Healthcare thought-leadership

AI matched or beat physicians on real-world clinical reasoning

A major new study from Harvard Medical School and Beth Israel Deaconess Medical Center has found that a large language model (LLM) outperformed physicians across a wide range of clinical reasoning tasks, including making emergency-room triage decisions from messy, real-world patient data.

The findings, published April 30 in Science, represent one of the largest comparisons yet between AI and physicians on clinical tasks. And the researchers say the results are significant enough to warrant the next step: rigorous, prospective clinical trials in real care settings.

Hundreds of
Clinicians compared against the AI model
3 task types
Diagnostic challenges, reasoning exercises, and real ER cases
Zero
Pre-processing applied to real-world ER data before testing

What made this study different

Prior AI benchmarks have often relied on multiple-choice tests or cleaned-up patient data. This study deliberately didn’t. The team fed the model raw electronic health records from real emergency department cases, messy, incomplete, and exactly as clinicians would encounter them in real life situations.

The model was evaluated at multiple stages of a standard ER visit, from early triage (when very little data is available) all the way through to admission decisions. At each stage, it was only given the information that would actually be on hand at that moment.

“We didn’t pre-process the data at all.”
— Adam Rodman, HMS assistant professor of medicine at Beth Israel Deaconess, and co-senior author

At the earliest decision points, where clinical data is thinnest and may even lack full context, the model matched or exceeded attending physicians in diagnostic accuracy. That surprised even the researchers.

Why current benchmarks are becoming obsolete

The study also highlights a growing problem in medical AI evaluation: models are now consistently scoring near 100% on traditional multiple-choice tests, making it impossible to track meaningful progress. The field needs harder, more realistic tests that mimic real patient encounters, exactly what this study attempted to provide.

The benchmark ceiling problem
“Models are increasingly capable. We used to evaluate models with multiple-choice tests; now they are consistently scoring close to 100 percent, and we can’t track progress anymore because we’re already at the ceiling.” — Peter Brodeur, co-first author

What this doesn’t mean

The researchers are clear that strong performance on clinical reasoning tasks is not the same as being ready to practice medicine autonomously.

What the researchers want you to know
  • A model may identify the top diagnosis correctly while also recommending unnecessary tests that could harm patients
  • Humans must remain the ultimate baseline when it comes to evaluating performance and safety
  • The next step is prospective clinical trials, not deployment

What comes next

The team’s position is that medical AI has reached a threshold where it should be studied the same way all new medical interventions are: through controlled clinical trials in real care settings. The question is no longer whether AI can reason about medicine, it’s whether, how, and where it should be used as a tool alongside physicians.

Author’s take: the right tool for the right moment

We are at a genuine inflection point in American healthcare. The US faces a deepening physician shortage, and nowhere is that more acute than in oncology. With roughly 25,000–28,000 active oncologists in the country, and one in three Americans expected to receive a cancer diagnosis in their lifetime, the math simply doesn’t work. Layer on top of that the burnout crisis pushing nurses and care staff out of the profession, and the system’s capacity problem becomes impossible to ignore.

This is why studies like this one matter beyond the academic. The question was never really “can AI beat a doctor?” It’s “can AI help an overstretched system see more patients, catch more cases earlier, and reduce the burden on specialists for every decision?”

The answer, increasingly, looks like yes. But with the right guardrails in place. HIPAA compliance, liability frameworks, and model explainability (clinicians need to understand why the AI flagged something, not just that it did) are all real and unsolved hurdles. But a human-in-the-loop model, where AI surfaces insights and humans act on them, is a practical and responsible starting point.

Perhaps most importantly, AI assistance could extend the effective reach of non-specialists. A general practitioner or nurse practitioner supported by a strong clinical AI can move patients further along their care journey without defaulting to a specialist for every ambiguous case. In a system already strained at the specialist level, that’s not a minor efficiency gain, it’s a structural shift in how care can be delivered.

References

  1. Brodeur P, et al. “Performance of a large language model on the reasoning tasks of a physician.”
    Science, April 30, 2026.
    doi:10.1126/science.adz4433
  2. Harvard Medical School / Beth Israel Deaconess Medical Center press release via EurekAlert, April 30, 2026.
    eurekalert.org
  3. National Cancer Institute SEER Program. “Cancer Stat Facts: Cancer of Any Site.”
    Lifetime risk data based on 2021–2023 data.
    seer.cancer.gov
  4. Medicus Healthcare Solutions. “Examining the Oncologist Shortage.” January 2025.
    Approximately 27,400 oncologists active in the US; projected shortage of 2,200+ by 2025.
    medicushcs.com
  5. American Society of Clinical Oncology (ASCO). “2025 State of the Hematology and Medical Oncologist Workforce in America.”
    JCO Oncology Practice, October 2025.
    asco.org
  6. Healio. “US oncologist shortage ‘severe,’ projected to grow by year’s end.” March 2025.
    healio.com
Ritika Bramhe

Ritika Bramhe is Head of Marketing and Product Marketing Manager at OnPage Corporation, where she wears many hats across positioning, messaging, analyst relations, and growth strategy. She writes about incident alerting, on-call management, and clinical communication, bringing a marketer’s perspective shaped by years of experience working at the intersection of IT, healthcare, and SaaS. Ritika is passionate about translating complex topics into clear, actionable insights for readers navigating today’s digital communication challenges.

Share
Published by
Ritika Bramhe

Recent Posts

What Is Network Operations Center (NOC)

📋 Table of Contents What Is a NOC? What Does a NOC Do? NOC vs.…

7 days ago

Why Response Speed Is the New Bedside Manner: What Hospitals Can Learn from Patient Behavior Research

When we talk about patient experience in hospitals, the conversation usually centers on clinical outcomes,…

1 week ago

🎁 Review OnPage For a $25 Gift Card | Healthcare Users

Healthcare Users Review: Gartner Thank you to everyone who has already taken the time to…

1 week ago

Do Hospitals Still Use Pagers in 2026? Pager Replacements

Remember the small rectangular devices that could receive short messages? Some may think of it…

2 weeks ago

Best Call Routing Software for On-Call Teams in 2026 (After-Hours & Emergency Routing)

Most teams don’t go looking for “call routing software." They’re trying to solve something more…

2 weeks ago

Top Hospital Mass Notification Software: OnPage (2026 Guide)

We've all seen scenes in Grey's Anatomy where a Code Silver or a Code Purple…

4 weeks ago