Blog

Who judges the AI your doctor uses?

Under the hood of medical AI

TL;DR

  1. Doctors use medical AI.
  2. Who judges medical AI? To judge its medical AI, OpenAI uses HealthBench.1
  3. With HealthBench, human doctors use AI to judge AI.2

But who judges the judge?

Here is how we judged the judge. 👇

How OpenAI grades its own medical AI

OpenAI's main medical benchmark is HealthBench. The headline 2026 result — ChatGPT for Clinicians scored 59.0 vs a 43.7 physician baseline — is from this benchmark family.

A benchmark score depends on what questions were asked, how the answer key was written, how responses were graded, and whether the answer key itself is right.

Who writes the answer key

HealthBench was developed with 262 physicians. Its rubrics are physician-written.

But OpenAI's own paper reports that in one physician-response experiment, physicians were shown model responses and "encouraged to copy-paste and improve" them.

Grading is also done by AI: HealthBench uses an LLM-as-judge to score responses against the physician-written rubrics. So both halves of HealthBench — the rubric drafting and the grading — involve AI.

"Doctor-approved" ≠ "independently verified."

Why it happens

Building benchmarks at scale is expensive. OpenAI and Anthropic contract huge data-labeling jobs to firms like Scale AI and Turing. Domain experts draft and review thousands of items as paid piecework. Real medical work. Fast. Uneven.

For a fuller picture, see The Medical AI Landscape.

What we found in the answer key

We audited all 1,200 medical claims in a public slice of HealthBench against the actual clinical literature. 22 decision-changing errors in the answer key. (Our audit covers the broader public HealthBench. HealthBench Professional is the next pass.)

One example: a rural-sepsis question about whether to start antibiotics before the hospital. HealthBench's answer cites the PHANTASi trial — a real randomized trial — and treats it as showing a mortality benefit. PHANTASi did not find a statistically significant mortality benefit in the overall trial population (28-day mortality 8% vs 8%, p=0.74).

When the answer key is wrong, a model that repeats the error scores "correct." A model faithful to the trial may be penalized.

What this means for you

The right question: does the cited clinical evidence actually support this claim for someone like me?

See for yourself

→ How does NoBSmed fit into this audit gap? See our mission.


1 "HealthBench" here refers to OpenAI's medical benchmark family. OpenAI uses the broader public HealthBench to grade ChatGPT on general medical knowledge, and the newer clinician-focused variant HealthBench Professional to grade ChatGPT for Clinicians. The headline 2026 result — ChatGPT for Clinicians scored 59.0 vs a 43.7 physician baseline — is from HealthBench Professional specifically. For simplicity, the rest of this post uses "HealthBench" to mean either or both variants. Our audit so far covers the broader public HealthBench; HealthBench Professional is the next pass.

2 Two AI involvements in HealthBench: (a) physicians drafting reference responses were sometimes shown OpenAI model outputs and could produce their answer "by copying and modifying parts of the existing responses or writing new responses altogether" (see OpenAI's HealthBench paper); (b) grading is done by an LLM-as-judge that scores responses against the physician-written rubrics. See also our Notes on terminology and authorship in the OSS audit report.

← Back to blog