Who judges the AI your doctor uses?

Under the hood of medical AI

TL;DR

Doctors use medical AI.

OpenAI uses HealthBench to judge medical AI.¹

HealthBench is written by doctors (with AI assist) and graded by AI.²

But who judges the judge?

Here is how we judged the judge. 👇

How OpenAI grades its own medical AI

OpenAI's main medical benchmark is HealthBench. The headline 2026 result — ChatGPT for Clinicians scored 59.0 vs a 43.7 physician baseline — is from this benchmark family.

A benchmark score depends on what questions were asked, how the answer key was written, how responses were graded, and whether the answer key itself is right.

Who writes the answer key

HealthBench was developed with 262 physicians. Its rubrics are physician-written.

But OpenAI's own paper reports that in one physician-response experiment, physicians were shown model responses and "encouraged to copy-paste and improve" them.

Grading is also done by AI: HealthBench uses an LLM-as-judge to score responses against the physician-written rubrics. So both halves of HealthBench — the rubric drafting and the grading — involve AI.

"Doctor-approved" ≠ "independently verified."

Why it happens

Building benchmarks at scale is expensive. OpenAI and Anthropic contract huge data-labeling jobs to firms like Scale AI and Turing. Domain experts draft and review thousands of items as paid piecework. Real medical work. Fast. Uneven.

For a fuller picture, see The Medical AI Landscape.

What we found in the answer key

We audited both public HealthBench variants — the broader May-2025-paper dataset and the newer HealthBench Professional (the variant behind the 59.0 vs 43.7 marketing headline). 1,298 medical claims checked against the actual clinical literature. 29 decision-changing errors in the answer key.

One example: a rural-sepsis question about whether to start antibiotics before the hospital. HealthBench's answer cites the PHANTASi trial — a real randomized trial — and treats it as showing a mortality benefit. PHANTASi did not find a statistically significant mortality benefit in the overall trial population (28-day mortality 8% vs 8%, p=0.74).

When the answer key is wrong, a model that repeats the error scores "correct." A model faithful to the trial may be penalized.

What this means for you

"Doctor-approved" ≠ "fact-checked."
"Cites a study" ≠ "the study supports the claim."
A benchmark score is not proof a system is reliable for your situation.

The right question: does the cited clinical evidence actually support this claim for someone like me?

See for yourself

📘 Full audit — github.com/borisdev/nobsmed-healthbench-audit (all 22 findings, every PMID and DOI)
🧪 Example report — 3 of the strongest findings, patient-facing
📝 Corrections welcome — open an issue on the repo

→ How does NoBSmed fit into this audit gap? See our mission.

¹ "HealthBench" here refers to OpenAI's medical benchmark family. OpenAI uses the broader public HealthBench to grade ChatGPT on general medical knowledge, and the newer clinician-focused variant HealthBench Professional to grade ChatGPT for Clinicians. The headline 2026 result — ChatGPT for Clinicians scored 59.0 vs a 43.7 physician baseline — is from HealthBench Professional specifically. Our audit covers both variants — 1,200 cited + high-stakes claims from public, 98 cited + high-stakes claims from Professional. For simplicity, the rest of this post uses "HealthBench" to mean either or both.

² Two AI involvements in HealthBench: (a) physicians drafting reference responses were sometimes shown OpenAI model outputs and could produce their answer "by copying and modifying parts of the existing responses or writing new responses altogether" (see OpenAI's HealthBench paper); (b) grading is done by an LLM-as-judge that scores responses against the physician-written rubrics. See also our Notes on terminology and authorship in the OSS audit report.