Skip to content

▣ production · HIPAA · 4 yrs

Agentic RAG in regulated healthcare

Production agentic RAG over docs, code, Confluence, and Jira for a HIPAA/ISO 13485 platform — compliance retrieval 30s → sub-second, verification 60% faster.

Period
Sept 2021 – July 2025
Role
Senior Software Engineer · Treatment Technologies & Insights
Status
case study
  • 30s → <1s compliance retrieval
  • 60% faster verification
  • #rag
  • #llm-evals
  • #mcp
  • #healthcare
  • #fastapi
  • #qdrant

The problem

Treatment Technologies & Insights builds healthcare software, which means HIPAA and ISO 13485 — every release carries a burden of evidence. Does this requirement trace to a design doc? Did the code change ship with its test? Was the ticket closed before release? That evidence lived in four systems: product documentation, the codebase, Confluence, and Jira.

Verification was manual cross-referencing across those silos. A single documentation lookup took around 30 seconds, and one verification pass chains many lookups. Worse, the real questions were multi-hop — “which SOP covers this validation step, and is the implementing ticket actually closed?” — and no single search index can answer a question whose second half depends on the first.

The architecture

I architected an agentic RAG system instead of a classic one-shot retrieve-then-generate pipeline. The agent has MCP tools for each evidence source — documentation, codebase, Confluence, Jira — and does multi-hop reasoning: query one source, use the result to choose the next call, chain until the question is resolved. Provenance is preserved per hop, so every answer can say which system the evidence came from.

Underneath sits the retrieval layer: 768-dimensional embeddings in Qdrant, hybrid BM25 + vector retrieval, similarity-threshold tuning, and chunking strategies tuned for compliance documents. Retrieval quality was validated with NDCG and MRR on labeled query sets before anything reached production.

Serving ran through FastAPI async inference servers handling LLM request routing — request batching, connection pooling, rate limiting, and failover across OpenAI, Anthropic, and Google.

Model selection went through an evaluation framework I designed and deployed: ChatGPT, Claude, Gemini, Llama, and Qwen, benchmarked on p50/p95 latency, cost-per-token, accuracy, and capability fit for healthcare compliance workflows.

Decisions that mattered

Tools over one giant index. The tempting design is to crawl everything into a single vector store. That dies on multi-hop questions and destroys provenance — “the index said so” doesn’t survive an audit. Giving the agent per-source tools kept every answer traceable to a system of record, which in this domain is the feature.

Hybrid BM25 + vector. Compliance text is dense with exact identifiers — SOP numbers, ticket keys, error codes. Pure semantic search loses exact-match precision; pure keyword search loses paraphrases. Hybrid retrieval with tuned similarity thresholds fixed both, and below threshold the system reports that it found no evidence instead of returning the nearest weak match. In a regulated environment, a confident wrong answer is strictly worse than no answer.

Retrieval measured separately from generation. NDCG and MRR against labeled queries, tracked as their own metrics. Nearly every bad answer traced back to retrieval, not the model — and you can’t fix what you’ve blended into one end-to-end score.

Evals over model loyalty. The five-model benchmark turned model choice into an engineering decision instead of a preference. Latency-sensitive interactive flows and accuracy-sensitive compliance flows don’t want the same model, and p50/p95 plus cost-per-token made those trade-offs explicit and revisitable as new models shipped.

No single-provider dependency. Rate limiting, batching, and failover across three providers in the FastAPI inference layer meant a provider incident degraded latency, not availability.

Numbers

  • Compliance documentation retrieval: 30 s → sub-second
  • Compliance verification workflows: 60% faster
  • 4 evidence sources in the agent loop: docs, codebase, Confluence, Jira
  • 768-dimensional embeddings in Qdrant; hybrid BM25 + vector retrieval
  • Retrieval quality gated on NDCG/MRR before production
  • 5 model families benchmarked on p50/p95 latency, cost-per-token, and accuracy; failover across 3 providers

Lessons

Retrieval is most of RAG quality. The glamour is in the agent loop; the wins came from threshold tuning, chunking, and hybrid scoring — measured, not vibed.

Agents earn their complexity only when questions are genuinely multi-hop. For single-lookup queries, a plain pipeline was fine; the agent paid for itself on the chained questions humans previously resolved by hand, which is exactly where the 60% verification speedup came from.

In regulated software, “show your sources” isn’t UX polish — it’s the requirement. Designing provenance into the architecture from the start is far cheaper than bolting citations onto a blended index later.

And four years of running this in production taught me that the unglamorous serving layer — batching, pooling, rate limits, provider failover — is what turns an AI feature into an actual system.