Health coach bot Drug research assistant
Case study

Shipped, monitored, guaranteed.

Every project listed here is live, handling real users, monitored by Preflight.

Healthcare

Health coach for a weight management startup

A RAG-grounded coaching agent aligned with the client's nutritionists — answering from approved protocols only, matching each dietician's communication style, handling 2,000+ conversations daily.

0
Hallucinations to users
2,041
Daily conversations
97.3%
Output adherence
Problem

Coaching doesn't scale.

The client is a health-coaching company with 60+ nutritionists managing weight-loss programs across India and the Middle East. Each patient gets a personalised diet plan, WhatsApp check-ins, and ongoing adjustments. The model works — retention is high, outcomes are strong.

The problem: each nutritionist can handle about 40 active patients. At that ratio, scaling means hiring proportionally. A team of 60 covers 2,400 patients. To reach 10,000, they'd need 250 nutritionists — and the hiring pipeline doesn't move that fast.

They'd tried a basic chatbot before. It lasted two weeks. The bot gave generic advice ("eat more vegetables"), ignored patient history, and on one occasion suggested a meal plan that conflicted with a patient's medication. The nutritionists pulled the plug.

"We don't need a chatbot. We need something that answers exactly the way Dr. Mehra would answer — and never, ever goes off-script on medical advice."
— Head of Product, client company
Approach

RAG with a structured context window.

We designed a retrieval layer that ensures the bot only answers from approved content. No general knowledge, no improvisation. Every response traces back to a source document that the client's medical team has signed off on.

The context window for each patient conversation is explicitly structured:

  • Past conversations — to match the assigned nutritionist's tone
  • Medical history and key risks
  • Current prescriptions
  • Weight and body metrics (weekly updates)
  • Goals and time horizon
  • Previous recommendations and adherence notes

Knowledge sources include: the client's proprietary diet protocols, a regional diet library covering South Asian, Middle Eastern, and Mediterranean cuisines, approved FAQ responses, and escalation SOPs for medical situations the bot should never handle.

GPT-4 Turbo Pinecone text-embedding-3-large WhatsApp Business API Preflight
Eval strategy

What Preflight monitors on every response.

The core risk isn't hallucination in the traditional sense — it's scope creep. The bot knows a lot about nutrition and will confidently answer questions about medication, exercise physiology, or medical conditions if you let it. That's the failure mode Preflight is designed to catch.

Preflight — Health coach Live
97.3%
Adherence
47
Blocked (30d)
0.91
Quality judge
1.2s
P95 latency
Source grounding (NLI)
142/146 claimsPass
Scope boundary enforcement
Diet-onlyPass
Tone match (persona: Dr. Mehra)
ConsistentPass
Medication mention detection
3 caught / 30dActive
Escalation SOP compliance
All paths coveredPass

In the first 30 days, Preflight blocked 47 responses — mostly medication-related scope violations where the LLM tried to be helpful about drug interactions. None of those responses reached patients. The model is not perfect. The system is.

Results

The numbers after 90 days.

0
Hallucinations to users
2,041
Daily conversations
40→120
Patients per nutritionist
97.3%
Output adherence

The bot handles the first 2-3 turns of most conversations — answering diet questions, logging meals, adjusting portions based on weekly weigh-ins. When a patient asks something outside scope (medication, exercise injury, emotional distress), the bot escalates to the human nutritionist with full context.

Each nutritionist now manages 120 patients instead of 40. The company is scaling to 10,000 patients without proportional hiring. Nutritionists spend their time on complex cases, not answering "can I eat rice at dinner."

Life sciences

Drug research assistant for a US pharma startup

A multi-agent research tool that searches scientific literature, extracts reagent tables, and drafts study designs — cross-checking every claim against multiple sources.

85%
Researcher time saved
40m
vs 3-day review
100%
Claims source-traced
Problem

Literature review is the bottleneck.

The client is a pre-clinical stage pharma startup with a team of 8 researchers. Before any experiment, someone has to review the existing literature — find relevant papers, extract methods and reagent details, identify conflicting findings, and draft the study design rationale. This process takes 2-3 days per research question. The team runs 3-4 questions per week. That means 40% of their research capacity goes to reading papers, not running experiments.

They'd tried ChatGPT directly. The outputs were fluent but unreliable — fabricated citations, hallucinated reagent catalogue numbers, confidently wrong dosing information. In drug research, a single wrong data point doesn't just waste time. It can derail a $200K experiment.

Approach

Three specialised agents working in sequence.

We built a multi-agent pipeline with three role-specialised agents, coordinated through our orchestration platform:

Agent 1 · Literature scanner

Takes a research question, searches PubMed and the client's internal knowledge base, retrieves relevant papers, and produces a ranked shortlist with relevance scores. Cross-references across databases to catch retracted or superseded studies.

Agent 2 · Data extractor

Reads the shortlisted papers and extracts structured data — reagent names, catalogue IDs, concentrations, conditions, vendor options, and approximate pricing. Outputs a normalised table aligned with the client's preferred vendors and procurement rules.

ReagentCatalogue IDEst. price
DMEM/F-1211320033$42/500ml
FBSA3160801$380/500ml
Matrigel354234$290/10ml
Y-27632SCM075$185/5mg
Agent 3 · Synthesis writer

Takes the literature overview and extracted data, then produces a structured output: concise topic overview, key findings with limitations flagged, open questions, and a candidate study design with controls, arms, and sample size rationale. Everything is explicitly marked as requiring human PI validation.

Multi-agent PubMed API GPT-4 Turbo Pinecone Preflight
Eval strategy

Every claim grounded. Every citation verified.

In pharma, the cost of a wrong answer is an order of magnitude higher than in most domains. A hallucinated reagent catalogue number means ordering the wrong chemical. A fabricated citation means building a study design on research that doesn't exist.

Preflight — Drug research assistant Live
100%
Citation verified
12
Blocked (30d)
0.93
Quality judge
8.4s
Avg pipeline
Citation existence (PubMed DOI)
All verifiedPass
Catalogue ID validation
Cross-ref vendor APIPass
Claim-to-source grounding (NLI)
All tracedPass
Retraction check
0 retracted citedPass
Human validation flags
All markedPass

In the first month, Preflight caught 12 issues — 8 were catalogue IDs that had been discontinued or superseded by the vendor, 3 were citations where the DOI resolved to a different paper than described, and 1 was a retracted study. All were blocked before reaching the researcher's output.

Results

The numbers after 60 days.

85%
Researcher time saved
40m
Per research question
100%
Citations verified
0
Wrong data reached researchers

What used to take 2-3 days now takes 40 minutes. Researchers still review every output — the agent generates, it doesn't decide — but the review is now "check the structured summary" rather than "read 30 papers and build the table from scratch."

The team went from 3-4 research questions per week to 12-15. Their experimental throughput hasn't tripled because of AI — it's tripled because their researchers spend time designing experiments instead of reading papers.

"The real value isn't speed. It's that I trust the output. Every catalogue number checks out. Every citation is real. That's what the last tool couldn't do."

Got an agent that needs to pass?

Run your agent through our eval suite. See what breaks, what passes, and what it takes to fix it.

Probe your agent — free →