When Physicians Go Head-to-Head with Large Language Models

By John Halamka • May 10, 2023

What happens when ChatGPT-4 and a human cardiologist are asked to diagnose the same patient? The results are quite revealing.

By Paul Cerrato, senior research analyst and communications specialist, Mayo Clinic Platform, and John Halamka, M.D., president, Mayo Clinic Platform

For decades, innovators have been looking for ways to use computer technology to improve patient care and ease clinicians' workloads. Some have even suggested that AI-based algorithms are as effective as physicians in diagnosing specific diseases. In 2016, Varun Gulshan at Google and his associates from several medical schools tested that theory, using a convolutional neural network (CNN) to analyze retinal images. Looking for evidence of diabetic retinopathy, they found the CNN's diagnostic skills were on par with that of several board-certified ophthalmologists. Several additional studies have followed to determine if machine learning tools could augment or replace clinicians.

For example, British and German investigators analyzed 82 studies that compared deep learning algorithms to clinicians to determine their respective abilities to detect diseases in medical images. They "found the diagnostic performance of deep learning models to be equivalent to that of health care professionals. However, a major finding of the review is that few studies presented externally validated results or compared the performance of deep learning models and health care professionals using the same sample." A second analysis from British and American researchers also compared AI to clinicians, concluding: "Few prospective deep learning studies and randomized trials exist in medical imaging. Most nonrandomized trials are not prospective, are at high risk of bias, and deviate from existing reporting standards. Data and code availability are lacking in most studies, and human comparator groups are often small."

Now that ChatGPT and other large language models have surfaced, many thought leaders are also wondering how they might perform when challenged with diagnostic puzzles. With that in mind, we sent a prompt to ChatGPT-4 that read as follows:

"Mr. Jones, 59 years old, with a history of hypertension, stroke, and elevated lipid levels, arrives in the ED complaining of sudden-onset intense substernal chest pain that radiates to his left leg but does not affect his left arm or jaw. He also has an elevated troponin I level. What is the correct diagnosis?"

ChatGPT replied: "Based on the symptoms and medical history provided, Mr. Jones is presenting with a potential myocardial infarction (MI), also known as a heart attack. The sudden onset of intense substernal chest pain that radiates to his left leg is a common symptom of a heart attack. The fact that the pain does not affect his left arm or jaw is not unusual, as some heart attacks can present with atypical symptoms.

"The elevated troponin I level is also a significant finding, as it indicates damage to the heart muscle. Troponin is a protein released into the bloodstream when heart muscle cells are damaged or die, which is a hallmark of a heart attack. It is important to note that a definitive diagnosis can only be made by a healthcare professional after conducting a thorough medical evaluation, including a physical examination, ECG, and other tests as necessary. Prompt medical attention is essential in suspected cases of heart attack, as early intervention can improve outcomes and reduce the risk of complications."

Several years ago, a case report published in the Journal of General Internal Medicine described an identical patient scenario in which J. William Schleifer, M.D. and his colleagues at the University of Alabama in Birmingham explained their diagnostic reasoning and the conclusion they reached. We provided a shortened version of that case in one of our recent books, Reinventing Clinical Decision Support: Data Analysis, Artificial Intelligence, and Diagnostic Reasoning. Based on a methodical review of all the patient data, Dr. Schleifer et al. questioned the significance of the patient's radiating left leg pain. One of the hallmarks of a genuine expert diagnostician is their more completely developed disease scripts and their ability to spot inconsistencies that don't fit into these scripts. The leg pain was one of those clues that might warrant a walk down a different diagnostic path.

They also used a reasoning technique sometimes referred to as pre-mortem examination. Essentially, they asked themselves: What would happen once a specific diagnosis is made and acted upon? What are the consequences, good and bad? In the case of Mr. Jones, if he is treated with the anticoagulants usually indicated for a typical MI, and he actually had another condition such as an aortic dissection, the consequences could prove disastrous. The pre-mortem analysis and the fact that the patient had radiating left leg pain were enough to postpone treating the alleged MI until additional data was collected. Once the patient was admitted to the medical floor, the appearance of a systolic murmur plus chest pain strongly suggested aortic dissection, a tear in this major blood vessel; the tear was finally confirmed with a CT angiogram. The imaging study also documented that the dissection extended all the way down Mr. Jones' thoracic descending aorta, which explained the mysterious leg pain.

Their correct diagnosis begs the question: Why didn't ChatGPT reach the same conclusion? The scenario dramatically illustrates the difference between LLMs trained on the general content of the internet and the "database" residing within the brain of a veteran cardiologist with decades of clinical experience and expert reasoning skills. It also highlights that LLMs don't actually reason, at least not in the way humans are capable of, including critical thinking skills that computers have yet to master. As Chrag Shah with the University of Washington explains, "Language models are not knowledgeable beyond their ability to capture patterns of strings or words and spit them out in a probabilistic manner… It gives the false impression of intelligence."

Admittedly, the chatbot did warn users that a definitive diagnosis was not possible and should only be made by a health care professional "after conducting a thorough medical evaluation, including a physical examination, ECG, and other tests as necessary." But realistically, many patients, and perhaps a few clinicians, would rely heavily on its conclusion, with life-threatening consequences.

That is not to suggest that LLMs have no place in medicine. One approach that has merit is retrieval-augmented generation (RAG), which can control model knowledge and reduce hallucinations. The idea is simple. Clinical information from Mayo Clinic, Duke, Intermountain, and Massachusetts General Brigham are trustworthy. An LLM based on content from these sources would be far less likely to mislead users with misinformation or hallucinations, provided it has appropriate guardrails inserted by its developers.

Like most new technologies, LLMs are neither a godsend nor a pending apocalypse. With the right blend of business skills, ethical principles, and the sincere desire to put patients' needs first, they'll eventually become a valuable part of medicine's digital toolbox.

When Physicians Go Head-to-Head with Large Language Models

Recent Posts