A major new study from Harvard Medical School and Beth Israel Deaconess Medical Center has found that a large language model (LLM) outperformed physicians across a wide range of clinical reasoning tasks, including making emergency-room triage decisions from messy, real-world patient data.
The findings, published April 30 in Science, represent one of the largest comparisons yet between AI and physicians on clinical tasks. And the researchers say the results are significant enough to warrant the next step: rigorous, prospective clinical trials in real care settings.
Prior AI benchmarks have often relied on multiple-choice tests or cleaned-up patient data. This study deliberately didn’t. The team fed the model raw electronic health records from real emergency department cases, messy, incomplete, and exactly as clinicians would encounter them in real life situations.
The model was evaluated at multiple stages of a standard ER visit, from early triage (when very little data is available) all the way through to admission decisions. At each stage, it was only given the information that would actually be on hand at that moment.
At the earliest decision points, where clinical data is thinnest and may even lack full context, the model matched or exceeded attending physicians in diagnostic accuracy. That surprised even the researchers.
The study also highlights a growing problem in medical AI evaluation: models are now consistently scoring near 100% on traditional multiple-choice tests, making it impossible to track meaningful progress. The field needs harder, more realistic tests that mimic real patient encounters, exactly what this study attempted to provide.
The researchers are clear that strong performance on clinical reasoning tasks is not the same as being ready to practice medicine autonomously.
The team’s position is that medical AI has reached a threshold where it should be studied the same way all new medical interventions are: through controlled clinical trials in real care settings. The question is no longer whether AI can reason about medicine, it’s whether, how, and where it should be used as a tool alongside physicians.
We are at a genuine inflection point in American healthcare. The US faces a deepening physician shortage, and nowhere is that more acute than in oncology. With roughly 25,000–28,000 active oncologists in the country, and one in three Americans expected to receive a cancer diagnosis in their lifetime, the math simply doesn’t work. Layer on top of that the burnout crisis pushing nurses and care staff out of the profession, and the system’s capacity problem becomes impossible to ignore.
This is why studies like this one matter beyond the academic. The question was never really “can AI beat a doctor?” It’s “can AI help an overstretched system see more patients, catch more cases earlier, and reduce the burden on specialists for every decision?”
The answer, increasingly, looks like yes. But with the right guardrails in place. HIPAA compliance, liability frameworks, and model explainability (clinicians need to understand why the AI flagged something, not just that it did) are all real and unsolved hurdles. But a human-in-the-loop model, where AI surfaces insights and humans act on them, is a practical and responsible starting point.
Perhaps most importantly, AI assistance could extend the effective reach of non-specialists. A general practitioner or nurse practitioner supported by a strong clinical AI can move patients further along their care journey without defaulting to a specialist for every ambiguous case. In a system already strained at the specialist level, that’s not a minor efficiency gain, it’s a structural shift in how care can be delivered.
📋 Table of Contents What Is a NOC? What Does a NOC Do? NOC vs.…
When we talk about patient experience in hospitals, the conversation usually centers on clinical outcomes,…
Healthcare Users Review: Gartner Thank you to everyone who has already taken the time to…
Remember the small rectangular devices that could receive short messages? Some may think of it…
Most teams don’t go looking for “call routing software." They’re trying to solve something more…
We've all seen scenes in Grey's Anatomy where a Code Silver or a Code Purple…