Fine-Tuning a Clinical AI Model to Frontier Parity

Why bigger isn't always better in clinical AI

You don't need frontier scale to reach frontier quality. You need a reward signal that's yours alone, and a tight loop to learn from it. Six weeks ago, we started replacing the best frontier model running in Heidi Evidence with a model of our own, a fraction of its size. On blind side-by-side evaluation, it has already reached parity, to the point where clinicians can no longer tell which is which.

This post is about how we got there, what the result does and doesn't cover, and why we think the pattern generalizes beyond our own use at Heidi.

The signal only clinicians can give

Evidence is Heidi's clinical search product, free to use outside of a patient session. A clinician asks a question and gets an answer grounded in real sources. Evidence has answered more than 3.5 million questions since launch. It’s not the volume of questions that’s valuable; it's that Evidence answers are backed by something the general-purpose labs can't buy, a real clinician telling us which of two responses was the better one. That preference is the signal we train on.

How we measure

We grade our models the way clinicians experience them, in a test we call SBS. Two answers to the same real Evidence question are shown side by side; the clinician is blinded to their sources and picks the one they prefer. A 50% win rate means parity: two answers from two models that a clinician can't distinguish in quality. We run blinded SBS evaluations before each model update.