Case Study
Hearly Knowledge
RAG-powered knowledge infrastructure, a scalable memory layer for AI agents and chatbots.
sub-200ms
retrieval latency
1 npm pkg
@hearly-knowledgebase/widget
Paying users
in production
The Problem
LLM applications are only as useful as the context they can access. Vanilla chatbots hallucinate or return generic answers because they have no awareness of the specific knowledge a business holds, whether in its docs, meetings, support tickets, or internal wikis.
Existing solutions either lock you into a proprietary SaaS platform (expensive, no control) or require you to stitch together five different services yourself (LangChain + Pinecone + chunking lib + embedding service + your own API layer). I wanted a single, embeddable package that any developer could drop into their stack in under 10 minutes.
Architecture Overview
Three layers: Ingestion → Storage → Retrieval
How It Works
- 1
Ingest
Source content (PDF, audio, URLs, webhooks) enters through the ingestion API. Audio goes through a transcription step first.
- 2
Chunk
Content is split into overlapping chunks (configurable size + overlap) to preserve context at boundaries.
- 3
Embed
Each chunk is converted to a dense vector using OpenAI's text-embedding-ada-002 model.
- 4
Store
Vectors land in pgvector (Postgres extension). Metadata (source, chunk ref, human edits) is stored in the same DB.
- 5
Retrieve
At query time, the user's question is embedded and an ANN search (cosine similarity, HNSW index) finds the top-k relevant chunks.
- 6
Rerank
The top-k candidates are scored a second time against the query for relevance. This catches cases where vector similarity alone surfaces chunks that are topically close but not actually useful for answering the question.
- 7
Respond
The reranked chunks are assembled into a context window and passed to GPT-4o. The response is streamed back with source citations.
Key Technical Decisions
pgvector over Pinecone
Keeping vectors in Postgres eliminates a managed-service dependency, lets me join vector results with relational metadata in a single query, and reduces operational surface area. HNSW indexing keeps retrieval under 200ms at the dataset sizes I'm targeting.
RAG over fine-tuning
Fine-tuned models bake knowledge into weights, so updating them requires a new training run. RAG keeps the knowledge layer separate from the model, so customers can edit or add content instantly without a re-deploy.
Embeddable widget over iframe
An iframe is a black box: no style customisation, no event access, and it gets blocked by CSP policies. A published npm package lets integrators style the widget to match their product and subscribe to query/response events for analytics.
What I Learned
- Chunking strategy matters more than model choice. Poor chunk boundaries hurt retrieval quality regardless of which embedding model you use.
- Metadata is first-class. Storing source references alongside vectors makes citations and human-in-the-loop editing possible. Without it, you have a black box.
- Developer experience is a product decision. The npm package and 10-minute integration target forced me to simplify the API surface until it was obvious.
- Shipping to paying customers early surfaces the real edge cases faster than any amount of internal testing.