How does NoBSmed score on HealthBench?

Different task. Here's what we mean.

TL;DR

HealthBench grades model outputs — the advice itself.

NoBSmed audits the evidence base — the studies the advice cites.

Different tasks. Scoring NoBSmed on HealthBench would mean cheating at a problem we don't claim to solve.

After we published our HealthBench audit, the most-asked question was the fair one:

"Ok, so how does your tool score on HealthBench?"

The honest answer is: we'd score zero. And that's the point.

What HealthBench grades

HealthBench gives a model a clinical prompt (a patient question, a clinician scenario) and grades the model's answer against a physician-written rubric. The model has to generate the advice — diagnose, recommend a drug, sketch a treatment plan, cite supporting evidence.

That's a generation task. It assumes the model can produce clinically-grounded advice.

What NoBSmed does

NoBSmed isn't an advice generator. It's an evidence checker. You paste a medical claim plus your personal context, and we check whether the cited clinical evidence actually supports the claim — for someone like you.

That's an audit task. We don't produce a recommendation; we produce a list of questions worth asking your doctor, cited to the actual studies.

The two products do different things:

	HealthBench grades	NoBSmed audits
Input	A clinical question	A medical claim + your context
Output	An advice answer	An applicability check
Judged on	Does the answer match physician-written rubrics?	Does the cited evidence actually support the claim for this patient?
Goal	Better advice generation	Better evidence scrutiny

Why scoring on HealthBench would be cheating at the wrong task

If you put NoBSmed in front of a HealthBench prompt, NoBSmed wouldn't say "take probiotics" or "don't start aspirin." It would say "the Blaabjerg 2017 meta-analysis the answer cites was an outpatient study with no elderly data — and a separate RCT (PLACIDE) tested probiotics in elderly inpatients and found no benefit. Ask your doctor whether the trial behind this recommendation was run in patients your age."

A HealthBench grader looking at that would mark it incomplete — there's no clinical advice, no "yes/no" answer. We'd score badly because we're not playing that game.

This is the same reason a Carfax report wouldn't pass a "Best Car of the Year" review. Carfax checks the history. The "best car" review grades the car. Different judgments, different tools.

We never claimed a causal model

NoBSmed's evolution has been step-by-step upstream, not toward causal reasoning:

Baked-in prediction — generic LLM answers from training data
Paraphrased abstracts — better, still summary-level
Full studies — retrieve the actual papers, not just abstracts
Granular parsing of full studies — extract structured population, intervention, effect-size details so we can match advice to your situation

None of those layers is a causal model. Causal reasoning — "this drug should work because mechanism X interacts with biology Y in patients with condition Z" — is a different problem. Some clinical AI products attempt it. NoBSmed deliberately doesn't pretend to.

Where causal reasoning could come from

Worth noting: the mechanisms section of clinical-trial papers does encode causal hypotheses. An LLM that read mechanisms carefully could, in principle, reason about why a drug works (or doesn't) in a specific population. That's a research direction, not a shipped product. If anyone's doing it well, we'd want to know.

For now, NoBSmed stays in evidence-audit territory: what does the published literature actually show, and how does that map onto your situation? That's a smaller claim than "I can give you medical advice." It's also a more verifiable one.

So why are we publishing the audit?

Two reasons, and neither is to show off NoBSmed.

To warn the public — clinician-facing tools like ChatGPT for Clinicians are being graded on a benchmark whose answer key has fabricated citations, inverted study results, and rubrics that contradict themselves. Patients should know.

To help the medical-AI developer community — if the benchmark we're all optimizing against has structural flaws, fixing them benefits everyone working in this space. Our findings, every source, are reproducible. Corrections welcome.

Want the receipts? The full audit — 29 decision-changing findings across both public HealthBench and HealthBench Professional, every PMID and DOI, reproducible end-to-end.

Or the patient-facing tour: Under the hood of medical AI.