Apr 2, 2026·1 min read·Kenji Watanabe

Shipping RAG to production without losing your sleep

An honest field report from building a clinical-grade retrieval system at Khynex — the eval harness mattered more than the model.

AI
RAG
Engineering

Most retrieval-augmented generation tutorials end where production begins. They show a notebook that returns a plausible answer, declare victory, and never mention the day a regulator asks why your model said the wrong thing about a heart condition.

We just shipped one of these systems for a clinical customer. Here's what actually mattered.

Eval before architecture

We spent the first two weeks of the engagement building an eval harness, not a pipeline. A 1,200-question test set, hand-curated by clinicians, with explicit grading rubrics for citation accuracy, hallucination, and latency. Every change to the system — new chunker, new reranker, new model — runs through it.

If you can't measure it, you can't ship it. And in a regulated context, you cannot ship it without a paper trail.

Hybrid retrieval, always

Pure vector search is a beautiful demo and a fragile production system. Hybrid BM25 + dense retrieval, with a learned reranker on top, was the only thing that survived adversarial queries from the clinical team.

Latency budgets are engineering constraints

A 4-second answer is a useless answer in clinical workflow. We held a hard 1.5s p95 budget end-to-end. That meant streaming, that meant a small reranker, and that meant pre-warming the embedding cache for common queries.

The unglamorous work

Most of the team's time went into the parts of the system that don't appear in the architecture diagram: the chunking strategy, the citation post-processor, the eval CI gate, the human-in-the-loop review queue.

The model is the easy part. Everything around the model is the engagement.

Next read

The zero-handoff team