Shipping RAG to production without losing your sleep
An honest field report from building a clinical-grade retrieval system at Khynex — the eval harness mattered more than the model.
- AI
- RAG
- Engineering
Most retrieval-augmented generation tutorials end where production begins. They show a notebook that returns a plausible answer, declare victory, and never mention the day a regulator asks why your model said the wrong thing about a heart condition.
We just shipped one of these systems for a clinical customer. Here's what actually mattered.
Eval before architecture
We spent the first two weeks of the engagement building an eval harness, not a pipeline. A 1,200-question test set, hand-curated by clinicians, with explicit grading rubrics for citation accuracy, hallucination, and latency. Every change to the system — new chunker, new reranker, new model — runs through it.
If you can't measure it, you can't ship it. And in a regulated context, you cannot ship it without a paper trail.
Hybrid retrieval, always
Pure vector search is a beautiful demo and a fragile production system. Hybrid BM25 + dense retrieval, with a learned reranker on top, was the only thing that survived adversarial queries from the clinical team.
Latency budgets are engineering constraints
A 4-second answer is a useless answer in clinical workflow. We held a hard 1.5s p95 budget end-to-end. That meant streaming, that meant a small reranker, and that meant pre-warming the embedding cache for common queries.
The unglamorous work
Most of the team's time went into the parts of the system that don't appear in the architecture diagram: the chunking strategy, the citation post-processor, the eval CI gate, the human-in-the-loop review queue.
The model is the easy part. Everything around the model is the engagement.
Next read
The zero-handoff team