The Emperor’s New Clothes

By John Halamka • August 1, 2025

Are chatbots making clinicians lazy, or convincing us to place too much confidence in their diagnostic prowess?

By Paul Cerrato, MA, senior research analyst and communications specialist and John Halamka, M.D., Diercks President, Mayo Clinic Platform

You may recall the folktale from Hans Christen Andersen about a vain despot that walks through the streets with no clothes, at which point a guileless child shouts out the obvious: The emperor is naked! If one takes a critical look at ChatGPT and similar large language models, one can’t help but wonder if these digital tools are also rather “threadbare”—especially when they are used in the healthcare ecosystem.

On the one hand, there is evidence to support the use of generative AI and other artificial neural networks to augment radiologists’ diagnostic skills. Similarly, a recent report from Microsoft suggests that focusing LLMs’ attention on sequential diagnosis may improve their usefulness. In the past, LLMs impressed users with their ability to pass medical license exams like the United States Medical Licensing Examination (USMLE), but critics pointed out that high scores on USMLE do not ensure a model’s ability to make accurate clinical decisions in the real world. Subsequent studies found that combining several commercially available LLMs improved diagnostic accuracy when they were used on 200 real world clinical cases from the Human Diagnosis Project platform. But once again, critics pointed out that experiments like this don’t really capture the diagnostic process that most clinicians work through. Microsoft, on the other hand, developed a system that “transformed these static narratives into interactive, stepwise diagnostic tasks,” which the developers believe bring us closer to the actual steps that physicians take.

Microsoft created the Sequential Diagnosis Benchmark (SDBench), derived from over 300 complex diagnostic puzzles written up in the New England Journal of Medicine. Unlike the static write-ups that have been used in the past to test to value of LLMs, SDBench generated an interactive set of stepwise diagnostic tasks. With the help of several AI agents, the system withholds the results of lab tests and imaging until they are called for. The LLM must decide what history is relevant, what tests to order, and when to make a final diagnosis. The research team also created MAI Diagnostic Orchestrator (MAI-DxO), which “orchestrates a panel of role-playing agents, each with a specific function: maintaining the evolving differential diagnosis, selecting high-value next tests, challenging premature closure, tracking costs, and ensuring logical consistency. A "chain of debate" among these agents drives the decision to either seek more information or commit to a final diagnosis.” Their results were impressive. When paired with OpenAI’s o3 model, MAI-DxO achieves 80% diagnostic accuracy compared to 20% for physicians on average. It also reduces diagnostic costs by 20% relative to physicians, and 70% compared to o3 alone.

When reviewing these results, several questions come to mind. It’s hard to believe that only 1 in 5 physicians would make an accurate diagnosis, but the esoteric disorders written up in the NEJM case reports might explain the low figure. But more importantly, one has to question just how realistic the experiment was. If you read through any of the NEJM Clinicopathological conferences, it’s plain to see how orderly and detailed the data is. A more typical patient record will likely contain missing units, pdf notes that have been scanned—which means they probably can’t be read by AI-based algorithms—and contradictions in readings from more than one clinician. And of course, most EHRs contain cryptic narrative notes with shorthand comments like EGFR < 30. To a trained physician, that’s a hard stop telling them not to administer drugs that can damage the patient’s kidneys (EFGR refers to estimated glomerular filtration rate. When that rate drops too low, it means the kidneys are incapable of handling the metabolic byproducts of drugs that are excreted in the urine). As Myoung Cha with Verily points out, an LLM is likely to see EGFR < 30 as opaque jargon and not send out a specific alert.

LLMs face other limitations as well, not the least of which is the tendency to discourage critical thinking among users who rely too heavily on them. For instance, Michael Gerlich, with the Center for Strategic Corporate Foresight and Sustainability, SBS Swiss Business School, studies cognitive offloading, which he refers to as “externalisation of cognitive processes, often involving tools or external agents, such as notes, calculators, or digital tools like AI, to reduce cognitive load.” With that concept in mind, Gerlich surveyed over 600 participants to see if there was any correlation between the use of AI tools and critical thinking skills. He found lower critical thinking skills among subjects between age 17 and 25. It seems overreliance on chatbots makes a person mentally lazy. The same phenomenon happens when you rely too heavily on a calculator and forget the multiplication tables you learned in elementary school.

It's impossible to turn back the clock. Large language models are here to stay, and when used cautiously, can lighten a clinician’s workload. But let’s not forget the Emperor’s mistake. It’s easy to imagine that these digital tools have magical powers that don’t exist.

The Emperor’s New Clothes

Recent Posts