Can Large Language Models Function as Scientific Reasoning Engines?
The capabilities of generative AI continue to grow. Using them wisely will likely improve clinical decision making.

By Paul Cerrato, MA, senior research analyst and communications specialist and John Halamka, M.D., Diercks President, Mayo Clinic Platform
Several thought leaders believe LLMs have the potential to help improve medical diagnosis and make accurate predictions about a patient’s future. Countless others now use these digital tools as research assistants. There’s now evidence that suggests they may also be capable of taking on an even larger role, serving as scientific reasoning engines. For example, one research team believes: “…[an] LLM can be provided with detailed text containing specific information, for example, as part of the prompt, which the LLM then interprets, compares and uses as context to produce its output. Providing such contextual information to the input prompt enables what is called ‘in-context learning’.”
This approach suggests that LLMs have the ability to perform symbolic logic, generate analogies, extrapolate key ideas from a scientific source, and evaluate evidence. Can they actually perform these complex cognitive tasks? And if so, how useful would they be in helping clinicians make difficult diagnostic and treatment decisions when there are no definitive recommendations in the medical community?
To address those questions, it helps to first define reasoning. The dictionary says it’s the capacity for logical, rational, and analytic thought and includes synonyms like induction, deduction, and syllogization. Clinical reasoning can be defined as “the set of reasoning strategies that permit us to combine and synthesize diverse data into one or more diagnostic hypotheses, make the complex trade-offs between benefits and risk of tests and treatments, and formulate plans for patient management.” Experienced nurses and physicians know this process very well and use it every day to manage their patients. It usually takes advantage of Type 1 and Type 2 thinking to reach a diagnosis or find an effective treatment option.
Type 1 thinking is used by most experienced clinicians because it’s an essential part of the pattern recognition process. This intuitive mode employs heuristics and inductive shortcuts to help them arrive at quick conclusions about what’s causing a patient’s collection of signs and symptoms. It serves them very well when the pattern is consistent with a common disease entity. Recognizing the typical signs and symptoms of an acute myocardial infarction, for example, allows clinicians to quickly take action to address the underlying pathology.
Type 2 reasoning, on the other hand, is particularly effective in scenarios in which the patient’s presentation follows no obvious disease script, when patients present with an atypical pattern, and when there is no unique pathognomonic signpost to clinch the diagnosis. It usually starts with a hypothesis that is then subjected to analysis with the help of critical thinking, logic, multiple branching, and evidence-based decision trees and rules. This analytic approach also requires an introspective mindset that is sometimes referred to as metacognition, namely, the “ability to step back and reflect on what is going on in a clinical situation.”
LLMs may appear to have these capabilities, but looking “under the hood” indicates otherwise. They don’t reason in the same way as experts who have a deep knowledge of a medical specialty. Instead, they draw conclusions based on probabilities that are derived from text data by stringing together words, phrases, and sentences that appear logical and coherent. And when they are unable to definitively answer a question posed by a user, they sometimes generate confident sounding inaccurate statements because they are trained to fill in the most likely words in a sentence, rather than rely in solid evidence. Essentially, they seek out patterns in their data sets and create content that conforms to these patterns. In the words of ChatGPT 01: “They are maximizing the likelihood of coherent-sounding output—rather than systematically evaluating each claim’s truth.”
In contrast, clinical decision support systems are derived from explicit, validated rules and transparent chains of inference that have been tested by clinicians and derived from their experience with real patients and their analysis of peer-reviewed medical studies. That is not to suggest that every diagnostic or therapeutic statement from an expert system is always right. Misdiagnosis remains a persistent problem in healthcare, which is why LLMs can play a useful role as an assistant.
Although the evidence doesn’t support the belief that LLMs can serve as autonomous reasoning engines, case reports suggest that at times they can match or even exceed the average clinician’s ability to make accurate diagnoses. Consider one example.
Issac Kohane, M.D., PhD, a pediatric endocrinologist and informatics expert, was called in to consult on a newborn with hypospadias, which affects a patient’s genitals. When he typed in the boy’s signs and symptoms and queried Gpt-4 for possible differential diagnoses, the LLM offered several options, including congenital adrenal hyperplasia (CAH), androgen insensitivity syndrome, and other rather esoteric disorders. Then, Dr. Kohane refined his prompt by adding the child’s hormone levels and the fact that an abdominal ultrasound revealed the baby had a uterus. The chatbot concluded that CAH was the most likely diagnosis and provided a detailed explanation of the likely pathophysiology. (A complete write up of the case is available in The AI Revolution in Medicine: GPT-4 and Beyond.) When Kohane did additional genetic testing, he confirmed CAH, pointing out that 99% of practicing physicians probably would not have made the diagnosis.
Of course, this example conceals that fact that many times ChatGPT gets it wrong, which is why developers, including Mayo Clinic, are working on creating more trustworthy assistants. Until that day arrives, it makes sense to use available Chatbots cautiously, refining prompts with carefully thought-out questions.
Recent Posts

By John Halamka and Paul Cerrato— Guidelines and guardrails for the safe use of AI require more than regulation. The Coalition for Health AI has created a framework to reduce these risks and improve the safety and effectiveness of these models.

By John Halamka and Paul Cerrato — The Oxford Dictionary says disease is a disorder of structure or function, especially one that has a known cause and a distinctive group of symptoms, signs, or anatomical changes. It’s time to rethink that simplistic definition.

By John Halamka and Paul Cerrato — In part one, we discussed the shortcomings of evidence-based medicine and the disconnect between RCTs and bedside clinical care. Part 2 explores possible solutions, including machine learning-based algorithms.