Comparing Large Language Models in Healthcare
Given their tendency to invent “facts,” several researchers have begun comparing their strengths and weaknesses.

By John Halamka, M.D., President, Mayo Clinic Platform and Paul Cerrato, MA, senior research analyst and communications specialist, Mayo Clinic Platform
“Models mostly know what they know, but they sometimes don’t know what they don’t know.” Put another way, AI-powered chatbots are usually accurate when asked to answer questions for which there’s clear-cut definitive data. If, for example, you ask a chatbot what metformin is indicated for, it will most likely state that it’s used to treat Type 2 diabetes because that fact is readily available. But choose a topic for which there is some controversy, and there’s a much greater risk of getting a misleading or fabricated response. Not only will it not admit it doesn’t know what it doesn’t know, it will sometimes invent an answer with great confidence. With these concerns in mind, healthcare professionals need some way to determine the accuracy and trustworthiness of the LLMs they use. Several studies have compared these digital tools to help you make a more informed decision.
An analysis summarized in Nature compared hallucinations from several LLMs and found the worst offenders included Technology Innovation Institute Falcon 7B-instruct, Google Gemme 1.1-2B-it, and Qwen 2.5.5B-instruct, with hallucination rates of 29.9%, 27.8%, and 25.2% respectively. The four best performers were Open AI ChatGPT-4, Open AI ChatGPT o1-mini, Zhipu AI GLM-4-9B-Chat, and Google Gemini 2.0 Flash Experimental (1.8%, 1.4%, 1.3%, 1.3%). A second analysis reviewed the ability of LLMs to improve systematic reviews written by humans. They compared GPT-3.5, GPT-4 and Bard and found hallucination rates of 39.6%, 28.6%, and 91.4% respectively.
Several investigators have also compared LLMs to determine their capacity to serve as a diagnostic assistant. Japanese researchers evaluated the capabilities of GPT-4o, Claude 3 Opus, and Gemini 1.5 Pro to solve diagnostic puzzles that radiologists typically face in routine practice. They started with over 300 quiz questions, which included clinical histories and imaging findings. Diagnostic accuracy was 41%, 54% and 33.9% respectively.
Others have compared several LLMs to determine if they might help clinicians do a differential diagnosis of rare diseases—an especially difficult challenge for physicians will little exposure to these unusual conditions. Using a BERT-based natural language processing model, they evaluated rare disease data sets from several countries that included over 2,000 cases and 431 diseases. Their tool, called PhenoBrain, “achieved an average predicted top-3 recall of 0.513 and a top-10 recall of 0.654, surpassing 13 leading prediction methods, including GPT-4.”
Similarly, Chinese researchers developed a generalist medical language model to assist with disease diagnosis—called MedFound—and compared its performance to several LLMs, including ChatGPT-4o, MEDITRON, Clinical Camel and Llama 3-70B. MedFound was trained on a large data set of medical text and real-world clinical records and used to find the best differential diagnoses. The accuracy of MedFound’s top-3 diagnoses was 84.2%, compared to the ratings for the other four models, with average accuracies ranging from 64.8% Clinical Camel-70B, ChatGPT-4o 62%, and 56.8% for MEDITRON-70B.
One of the major weaknesses of most LLMs is the data sets that draw from, typically the entire Internet, with all its useful facts and fallacies. Other companies have recently developed models that draw from far more reliable biomedical data sets. Consensus and OpenEvidence* come to mind. According to Consensus, its platform uses a collection of academic papers, research studies, and systematic reviews. Similarly, OpenEvidence uses data from a variety of well-respected sources, including peer reviewed medical journal articles.
Fortunately, many of the shortcomings of LLMs can be addressed by prompt engineering, which in its simplest form involves asking second and third questions in response to the first answer the chatbot gives. If you ask ChatGPT to provide a list of studies on the value of a specific diagnostic test, for instance, one follow-up prompt should ask: Are all these references from real, peer reviewed medical journals. If they aren’t, the LLM will often admit that and point the reader to legitimate journals. Even then, it’s best to follow up on the citations by visiting the journal websites to confirm their trustworthiness.
Obviously, we have a long way to go before we can rely on LLMS the way we rely on PubMed or other trusted clinical decision support systems. But at the same time, many clinicians have found these digital tools have value when used judiciously.
*Footnote: OpenEvidence is a Mayo Clinic Platform_ Accelerate medical search company which Mayo Clinic has a financial interest in. Mayo Clinic will use any revenue it receives to support its not-for-profit mission in patient care, education, and research.
Recent Posts

By John Halamka and Paul Cerrato — In part 2 of our series on the basics of digital technology, we explore the deep learning tools that can improve medical image analysis and much more.

By John Halamka and Paul Cerrato — In this new series, we pull back the curtain on the Wizard of Oz to provide plain English explanations about machine learning, artificial neural networks, natural language processing, large language models, and related technology.

By John Halamka and Paul Cerrato—A well-reasoned, coherent thesis is not enough to convince editorial gatekeepers to accept your article. Consider these additional suggestions.