Can Large Language Models Offer Intelligent Clinical Reasoning?

At face value, LLMs seem to exhibit the logical, analytical skills of experienced clinicians. But trying to comprehend what’s “under the hood” remains a challenge.

By Paul Cerrato, senior research analyst and communications specialist, Mayo Clinic Platform, and John Halamka, M.D., President, Mayo Clinic Platform.

Shakespeare once said “All that glitters is not gold.” Many generative AI skeptics believe these words of wisdom apply to the countless large language models (LLMs) now springing up in the healthcare ecosystem. While there’s evidence to suggest they can lighten the administrative load carried by providers and insurers, the most pressing question still being debated is whether they can assist physicians and nurses as part of a clinical decision support system (CDSS). More specifically, can LLMs provide, or complement, the deep clinical reasoning needed to solve a complex diagnostic puzzle or help clinicians choose the best therapeutic option for an individual patient?

Earlier this year, we posted a blog in which we compared ChatGPT-4 to a human cardiologist’s ability to diagnose a patient. In that article, we described a patient scenario and asked the chatbot to make a diagnosis: "Mr. Jones, 59 years old, with a history of hypertension, stroke, and elevated lipid levels, arrives in the ED complaining of sudden-onset intense substernal chest pain that radiates to his left leg but does not affect his left arm or jaw. He also has an elevated troponin I level. What is the correct diagnosis?" The chatbot concluded that the patient probably had experienced a myocardial infarction. The same scenario had been written up in a previous publication in which a physician reviewed the same data and came to a very different conclusion. His analysis suggested the patient had experienced an aortic dissection, which was confirmed by additional testing. Had a clinician relied on the chatbot diagnosis and administered an anticoagulant, the results would likely have been deadly.

To be fair, however, there are many other, more sophisticated ways to use ChatGPT, rather than directly asking it to make a diagnosis, and several researchers have explored these options to evaluate an LLM’s ability to do clinical reasoning. Hidde ten Berg and associates, for instance, did a retrospective analysis of 30 ED patients, asking ChatGPT to generate a differential diagnosis for each patient based on the notes that a physician had entered into the record during their initial ED presentation. The chatbot’s responses were compared to the physicians’ diagnoses. Using the patients’ medical history and physician exams, clinicians “correctly included the diagnosis in the top differential diagnoses for 83% of cases,” compared to 87% for ChatGPT-4. When lab results were included in the analysis, clinicians’ accuracy increased to 87% while ChatGPT-4 remained at 87%.

Similarly, Arya Rao, with the Massachusetts General Hospital and associates presented 36 patient scenarios extracted from the Merck Sharpe & Dohme (MSD) Clinical Manual to ChatGPT using an iterative prompting approach. The chatbot was initially asked to create a differential diagnosis, then review diagnostic testing, make a final diagnosis, and recommend treatment. ChatGPT obtained an overall accuracy of 71.7% across all 36 clinical vignettes, with its strongest performance in making a final diagnosis (76.9%) versus 60.3% for the initial differential diagnosis.

Yat-Fung Shea et al  have also evaluated ChatGPT’s ability to serve as a clinical decision support tool. They reviewed the medical histories of six geriatric patients whose definitive diagnosis had been delayed by more than a month. Their presentations were given to ChatGPT-4 and a commercially available CDSS (Isabel Healthcare) for analysis. The chatbot was not informed of clinicians’ final diagnosis. Among the six patients, GPT-4 accurately diagnosed four out of six (66.7%), clinicians two out of six (33.3%), and Isabel zero. The investigators discovered that certain key words helped determine how accurate the chatbot was in making the diagnosis, including abdominal aortic aneurysm, proximal stiffness, acid-fast bacilli in urine, and metronidazole.

Studies like these have prompted some clinicians to assert that LLMs are capable of causal reasoning and logical thinking, concluding: “Rather than competing with conventional information sources such as search engines, databases, Wikipedia, research articles, reviews, or textbooks, LLMs offer an entirely new means of information processing and synthesis, enabling the user to obtain an improved logical understanding of the scientific and medical literature.”

While it is clear that ChatGPT can provide new diagnostic and treatment possibilities that may not quickly occur to busy clinicians, the technology behind the chatbot relies primarily on an assortment of mathematical calculations and attention mechanisms, not an innate reasoning process, at least not in the way that medical experts usually think of the term. As Rao et al. point out “ChatGPT’s answers are generated based on finding the next most likely 'token'—a word or phrase to complete the ongoing answer.” That doesn’t sound like reasoning, but simply the quest to obtain statistical probability.

That is not to suggest that the technology underlying LLMs can’t perform some impressive feats, especially with the help of prompt engineering. If prompts given to an LLM are structured properly, or if each prompt and answer is followed by a more probing query, the results can be indeed be impressive. Peter Lee with Microsoft, the principal investor of ChatGPT, provided it with a detailed description of a 43-year-old-woman who presented to the ED with abdominal pain, nausea, vomiting, right quadrant pain, elevated WBC count, etc., and asked “What is your initial impression?” The bot provided a detailed response that suggested the physician consider not just appendicitis but ovarian torsion and ectopic pregnancy, pointing out that imaging might pin down a definitive diagnosis. Lee replied to that answer by asking if a CT scan might be inappropriate given the possibility of pregnancy, which would expose the fetus to ionizing radiation. The chatbot suggested ultrasound as a possible alternative. The two-way conversation continued, with both participants offering useful insights on the best course of action.

While conversations like this have attracted the attention of many clinicians and technology developers, the question remains: Is this the kind of deep clinical reasoning that experienced healthcare experts use to make decisions? An expert clinician will conduct an individualized medical history and physical exam, often adjusting each component of the work up based on a patient’s subtle physical and emotional cues. That’s not something a chatbot is capable of. They are also skilled at choosing and interpreting lab tests and imaging results, which don’t always conform to the recommendations outlined in the textbook cases ingested by LLMs. That interpretation involves reviewing sensitivity, specificity, area under the curve, and positive predictive value. The best clinicians are also self-aware enough to watch for a long list of cognitive biases that can influence the decision-making process. Finally, they will take into account published clinical guidelines, scoring systems, and various other decision aids. ChatGPT, on the other hand, usually doesn’t provide references to the biomedical literature or clinical guidelines for the recommendations it provides.

The debate on what’s “under the hood” of LLMs will likely continue for some time. Some thought leaders point to evidence showing these tools have logical, analytical skills that are very similar to our own. Others believe LLMs are simply good at calculating the statistics of language and can “pattern match output well enough to mimic understanding.” Until this puzzle is solved, it’s best to use LLMs to supplement our innate reasoning ability and clinical experience, always keeping in mind that they sometimes invent things that sound very plausible.


Recent Posts