Back to portfolio

Case Study

Hearly Knowledge

RAG-powered knowledge infrastructure, a scalable memory layer for AI agents and chatbots.

sub-200ms

retrieval latency

1 npm pkg

@hearly-knowledgebase/widget

Paying users

in production

The Problem

LLM applications are only as useful as the context they can access. Vanilla chatbots hallucinate or return generic answers because they have no awareness of the specific knowledge a business holds, whether in its docs, meetings, support tickets, or internal wikis.

Existing solutions either lock you into a proprietary SaaS platform (expensive, no control) or require you to stitch together five different services yourself (LangChain + Pinecone + chunking lib + embedding service + your own API layer). I wanted a single, embeddable package that any developer could drop into their stack in under 10 minutes.

Architecture Overview

INGESTIONSTORAGERETRIEVALDocumentsPDF · MD · TXTAudio / VideoTranscriptionWeb / APIURL · WebhookChunkerOverlap + SizeEmbedderOpenAI AdapgvectorPostgres + vector ext.HNSW index · sub-200msMetadata StoreSource · chunk refHuman editsQueryUser / AgentANN Searchcosine sim · top-kContext BuilderRe-rank · assembleLLMGPT-4o · completionResponseStreamed · cited

Three layers: Ingestion → Storage → Retrieval

How It Works

  1. 1

    Ingest

    Source content (PDF, audio, URLs, webhooks) enters through the ingestion API. Audio goes through a transcription step first.

  2. 2

    Chunk

    Content is split into overlapping chunks (configurable size + overlap) to preserve context at boundaries.

  3. 3

    Embed

    Each chunk is converted to a dense vector using OpenAI's text-embedding-ada-002 model.

  4. 4

    Store

    Vectors land in pgvector (Postgres extension). Metadata (source, chunk ref, human edits) is stored in the same DB.

  5. 5

    Retrieve

    At query time, the user's question is embedded and an ANN search (cosine similarity, HNSW index) finds the top-k relevant chunks.

  6. 6

    Rerank

    The top-k candidates are scored a second time against the query for relevance. This catches cases where vector similarity alone surfaces chunks that are topically close but not actually useful for answering the question.

  7. 7

    Respond

    The reranked chunks are assembled into a context window and passed to GPT-4o. The response is streamed back with source citations.

Key Technical Decisions

pgvector over Pinecone

pgvectorPinecone

Keeping vectors in Postgres eliminates a managed-service dependency, lets me join vector results with relational metadata in a single query, and reduces operational surface area. HNSW indexing keeps retrieval under 200ms at the dataset sizes I'm targeting.

RAG over fine-tuning

RAGFine-tuning

Fine-tuned models bake knowledge into weights, so updating them requires a new training run. RAG keeps the knowledge layer separate from the model, so customers can edit or add content instantly without a re-deploy.

Embeddable widget over iframe

Widget (npm)iframe embed

An iframe is a black box: no style customisation, no event access, and it gets blocked by CSP policies. A published npm package lets integrators style the widget to match their product and subscribe to query/response events for analytics.

What I Learned

  • Chunking strategy matters more than model choice. Poor chunk boundaries hurt retrieval quality regardless of which embedding model you use.
  • Metadata is first-class. Storing source references alongside vectors makes citations and human-in-the-loop editing possible. Without it, you have a black box.
  • Developer experience is a product decision. The npm package and 10-minute integration target forced me to simplify the API surface until it was obvious.
  • Shipping to paying customers early surfaces the real edge cases faster than any amount of internal testing.