Large Language Models Inch Their Way Toward Competence

By John Halamka • March 25, 2024

The latest research suggests that these digital tools will eventually become useful medical assistants, but we still need to watch for unpredictable fabrications and misstatements.

By Paul Cerrato, MA, senior research analyst and communications specialist, Mayo Clinic Platform and John Halamka, M.D., President, Mayo Clinic Platform.

Generative AI evangelists believe that large language models (LLMs) are having a profound impact on healthcare, moving us into a less stressful, more productive ecosystem that benefits patients and clinicians alike. A dispassionate analysis of the most recent research offers a more nuanced approach, suggesting that LLMs are inching their way toward competence, on both the administrative and clinical side of medicine.

In a recent blog, we reviewed the evidence supporting the role of ChatGPT as a clinical decision support tool. Several studies have demonstrated that the chatbot does quite well at answering questions on the medical licensing exam. But as most clinicians know, dealing with static situations described in an exam is much easier than making a diagnosis in the real world of clinical medicine. With that in mind, others have taken up the challenge. For example, Yat-Fung Shea et al have evaluated ChatGPT’s ability to serve as a clinical decision support tool. They reviewed the medical histories of six geriatric patients whose definitive diagnosis had been delayed by more than a month. Their presentations were given to ChatGPT-4 and a commercially available CDSS (Isabel Healthcare) for analysis. The chatbot was not informed of clinicians’ final diagnosis. Among the six patients, GPT-4 accurately diagnosed four out of six (66.7%), clinicians two out of six (33.3%), and Isabel zero. The investigators discovered that certain key words helped determine how accurate the chatbot was in making the diagnosis, including abdominal aortic aneurysm, proximal stiffness, acid-fast bacilli in urine, and metronidazole.

A recent experiment conducted by Google DeepMind provides further evidence to show LLMs are maturing. Daniel McDuff and his colleagues gave about 20 physicians who were not board-certified specialists a series of challenging case scenarios that had been previously published by the New England Journal of Medicine, part of the clinicopathological conferences series from the Massachusetts General Hospital. McDuff et al asked their LLM to generate a differential diagnosis list on its own, and also asked the clinicians to arrive at their own differential diagnoses with the help of the LLM. The researchers explain: “Our study introduces an LLM for DDx, a model which uses a transformer architecture (PaLM 2), fine-tuned on medical domain data; alongside an interface for enabling its use as an interactive assistant for clinicians.”

During their experiment, each case report was read by two physicians. They were randomly selected to either use traditional means of arriving at a differential diagnosis with the help of a search engine and the usual medical resources, or they were allowed to use the LLM along with these traditional tools. Before they were placed in one of these two groups, they were asked to generate an unassisted differential diagnosis (DDx). The standalone performance of the LLM was initially compared to the performance of unassisted physicians. Results for top-10 accuracy 59.1% vs 33.6%, [p = 0.04]. When the investigators compared the group who were assisted by the LLM to the group who used the more traditional approach—search engines and books—the LLM once again came out ahead. Top-20 accuracy 51.7% vs 36.1%. In addition, those assisted by the LLM delivered a more comprehensive list of potential diagnoses, when compared to those who were not given access to this tool.

While their results demonstrate that a properly trained LLM can do far more than answer questions on a medical licensing exam, the experiment had some important weaknesses. McDuff et al acknowledge:

Performance in a clinicopathological conference “in no way reflects a broader measure of competence in a physician's duties."
“DDx comprises many other steps that were not scrutinized in this study, including the goal-directed acquisition of information under uncertainty.”
“…the clinical pathology case presentation format and input into the model do differ in important ways from how a clinician would evaluate a patient and generate their DDx at the outset of a clinical encounter.”

LLMs also have the potential to redefine how clinicians interact with EHRs, and how they gather a more complete patient history, including any social determinants of health (SDOH). By one estimate, between 80 and 90% of modifiable risk factors that affect patient outcomes involve SDOH. Unfortunately, these social influences are all too often overlooked or never entered into EHRs. Harvard Medical School researchers developed an AI model to extract these modifiers from patients’ records, pulling out six social risk factors: employment, housing, transportation, parental status, relationship, and social support. Their LLM, called Flan-T5 XL, detected 93.8% of patients with adverse SDOH, compared to the ICD-10 codes, which only captured 2%.

The best clinicians know that being an expert diagnostician requires they master the art of noticing. In many ways, it’s not much different than the skills one needs to develop when piecing together a 1,000-piece jigsaw puzzle. A picture of a bouquet, for instance, may include flowers are various colors, but it’s not enough to separate pieces by color. One must notice subtle differences between each flower—the folds in their petals, the stems that intertwine with other parts of the picture, the pieces that share space with the adjacent wall, and so on. Once you’ve finished the puzzle, you’ve learned subtle differences between lilies, irises, peonies, and azaleas that you never paid attention to before. It’s not that different from noticing subtle changes in a patient’s speech, minor variations in their heartbeat, slight changes in skin tone, or dozens of other signposts that enable you to clinch a difficult medical diagnosis. While we’re optimistic that LLMs will eventually become competent medical assistants, it’s unlikely they will master the skills of an expert puzzle maker, or have the subtle perceptive skills of an experienced human diagnostician.

Large Language Models Inch Their Way Toward Competence

Recent Posts