Fine-tuning a clinical AI model to frontier parity
Dr. Tom Kelly
Co-founder & CEO•June 15, 2026•8 min read
Why bigger isn't always better in clinical AI
You don't need frontier scale to reach frontier quality. You need a reward signal that's yours alone, and a tight loop to learn from it. Six weeks ago, we started replacing the best frontier model running in Heidi Evidence with a model of our own, a fraction of its size. On blind side-by-side evaluation, it has already reached parity, to the point where clinicians can no longer tell which is which.
This post is about how we got there, what the result does and doesn't cover, and why we think the pattern generalizes beyond our own use at Heidi.
The signal only clinicians can give
Evidence is Heidi's clinical search product, free to use outside of a patient session. A clinician asks a question and gets an answer grounded in real sources. Evidence has answered more than 3.5 million questions since launch. It’s not the volume of questions that’s valuable; it's that Evidence answers are backed by something the general-purpose labs can't buy, a real clinician telling us which of two responses was the better one. That preference is the signal we train on.
How we measure
We grade our models the way clinicians experience them, in a test we call SBS. Two answers to the same real Evidence question are shown side by side; the clinician is blinded to their sources and picks the one they prefer. A 50% win rate means parity: two answers from two models that a clinician can't distinguish in quality. We run blinded SBS evaluations before each model update.
Preference isn't the whole bar, though. A side-by-side captures the average and can miss the tails, so a model only counts as better once it has cleared a separate safety and quality check, run offline against curated test sets like the public HealthBench Pro and our own Heidi Medical QA. Those catch what preference voting waves through, like hallucinations, weak instruction-following, and template drift. We hold safety as the top priority, then optimize for the answer clinicians prefer. Alongside the benchmarks, we track clinician feedback, thumbs up and down, in production. A model only graduates once it clears all three: blinded preference, the safety bar, and real-world use.
The result
On out-of-session Evidence, our fine-tuned model now reaches a 49.9% SBS win rate against the frontier model it replaced, Sonnet 4.6. Two honest limits are worth stating plainly:
Scope: this covers out-of-session Evidence, the search and reasoning a clinician does outside a live visit, and, not yet, the in-session work where Evidence reaches into a patient's context and takes action.
Timing: the model is rolling into production now rather than already serving every query.
The method
Evidence is the hardest model we've fine-tuned and our first agentic one. Where a scribe model summarizes what it's given, the Evidence model has to decide which source to pull, whether to keep searching, and when it has enough to answer. There's no single rule for the right moment to stop, so the model has to calibrate its own uncertainty, which is what should trigger a search in the first place. That makes it closer to a long-horizon reasoning problem than to next-token prediction, and roughly an order of magnitude harder than our summarization models. Reaching parity here, on our own model, is the real step-up.
BlockImage
The method itself started from a strong open-weight base rather than a model we pre-trained ourselves, since the gap we care about lives in post-training, not pre-training. The base supplies the general reasoning, and our data supplies the clinical judgment on top of it. From there, we ran three stages, all anchored to the same signal:
Supervised fine-tuning on teacher rollouts, filtered down to the answers clinicians preferred, so that we trained on the top of the distribution rather than its average. We began with a few thousand preference-filtered examples and grew the set as the signal held up.
On-policy self-distillation, to sharpen the model's own strongest behavior without dragging in off-policy noise.
Direct preference optimization (DPO) runs directly on the side-by-side data, so we train on the exact signal we grade by.
The thread running through all three is that the metric we optimize and the metric we ship against are one and the same: clinician preference. Training and evaluation were built together rather than bolted on afterward, and that harness, with the data behind it, is where the lift came from, not the base weights.
The thing general-purpose labs can't buy
This is also why proprietary data changes the math. General models optimize against general rewards, helpfulness, harmlessness, the broad shape of a useful assistant, and those signals are everywhere, with every lab climbing the same hill on far more compute than we have. Clinical quality is a different objective, and the function that defines it simply isn't in general data. It lives in which answer a clinician prefers when the stakes are real. The things that make an answer good there, how it's formatted, how brief it is, how it weights its sources, whether it's clinically true, aren't in web data either.
The part people miss is that raw scale isn't the asset here. Curated preference is. The signal we train on isn't grunt work that a bigger model can pattern-match its way past. Every side-by-side resolved in Evidence is one more label on that function, and together they add up to a reward function built from clinical judgment rather than scraped text.
The loop
What keeps it improving is the loop. A good product earns clinician use, that use makes the model better, and a better model makes for a product clinicians trust and reach for more often. Each turn compounds value for the clinicians who use it, and it keeps going whether or not we touch it. The early shape is already visible, with Evidence now answering over 300,000 questions a week.
What owning the model layer gives us
BlockImage
Owning the model end-to-end, instead of renting someone else's, is what makes all of this possible, and safety is the first reason it matters. As Heidi moves closer to clinical care, the system behind it has to behave like a medical device, staying consistent and inside known error bars right down to the inference, and we can only stand behind that if the model is ours. Owning it also means we can host open weights wherever data residency and governance require, and audit how the model behaves when an answer matters.
Parity is only half of it. Our model matches the frontier model on the answers clinicians prefer and matches or beats it on clinical safety, and because it's smaller, it serves those answers faster. Clinicians get quicker answers, without trading away quality or safety. Running it efficiently also lets us keep out-of-session Evidence free for every clinician, not just for those who can afford it.
That's what owning the model really buys us. It democratizes access, so clinicians anywhere, and the patients they treat, get the same safe, trusted answers. Just another step on our mission to double the world’s healthcare capacity.