For most of medical history, diagnosis was the one clinical task that seemed genuinely hard to automate. Pattern recognition was only half the job. A doctor also had to weigh a patient's full history, account for the things they weren't saying, and navigate the messy ambiguity of overlapping symptoms. That argument for human primacy is getting harder to make.
A report published in Science magazine surveys the recent research and finds a consistent pattern: AI systems are outperforming doctors on diagnostic accuracy across a widening range of conditions in controlled studies. The systems in question are large language models (LLMs) - the same class of technology that powers ChatGPT and Claude, trained to reason through text by processing enormous amounts of data, including medical literature and case records.
What the Studies Actually Show
The strongest results appear in rare disease identification and complex multi-symptom cases - exactly the scenarios where individual physician experience becomes a bottleneck. A doctor can only have personally encountered so many presentations of a rare condition. An AI trained on millions of case records doesn't carry that limitation.
Researchers also point to cognitive bias as a factor. Clinicians, like all people, tend to anchor on their first read of a case and can struggle to revise that judgment when new information arrives. LLMs process all available data simultaneously, without the mental fatigue that affects human reasoning on a busy shift. In controlled head-to-head evaluations, that consistency translates to measurably fewer missed diagnoses.
The benchmark that gets cited most often is the USMLE - the United States Medical Licensing Examination, the multi-stage test that every doctor in the US must pass to practice. GPT-4 cleared it at a passing level, and more recent models have pushed well above that threshold. Passing a licensing exam is not the same as practicing medicine, but it does demonstrate genuine clinical knowledge, not just keyword matching.
Where the Limits Are
None of this means a chatbot should be triaging your emergency department. The studies are conducted in controlled environments with structured data inputs, which is very different from a real clinic visit with incomplete information, an anxious patient, and time pressure. AI systems also have no physical examination capability and can't account for the non-verbal signals an experienced clinician picks up in the room.
There's also a distribution problem. These models perform well on average across large sample sizes. For any individual case, there's no reliable way to know whether you're getting one of the correct diagnoses or one of the errors - and in medicine, errors carry consequences that don't average out.
The more grounded read of this research is that AI is ready to be a serious diagnostic decision-support tool, not a replacement. A physician reviewing an AI's differential diagnosis before finalizing their own is a different proposition than an AI replacing the physician entirely. The studies suggest that combination - human clinician plus AI second opinion - likely outperforms either alone.
For people already using AI tools in their work, the healthcare findings are a useful data point: these systems are capable of rigorous analytical reasoning in high-stakes domains, not just drafting emails. The question of how to integrate that capability into professional workflows without losing the human accountability layer is one that medicine is only beginning to work out, but it's a question that will land in most knowledge-work fields eventually.