The next installment in our primer series uses plain English to explain what many thought leaders are calling a paradigm shift in medical AI.

By Paul Cerrato, MA, senior research analyst and communications specialist, Mayo Clinic Platform, Professor at Northeastern University, and John Halamka, M.D., Diercks President, Mayo Clinic Platform
If you were to look up a definition of foundation models (FM), you might find something like this: “Foundation machine learning models are deep learning models capable of performing many different tasks using different data modalities such as text, audio, images and video. They represent a major shift from traditional task-specific machine learning prediction models.” Unfortunately, without a background in computer science, this explanation would probably not be very helpful because it relies on terms that are equally unfamiliar. Among the cryptic terms: models, deep learning, and machine learning (ML). To understand the role of foundation models, it helps to first understand these more basic concepts.
In the AI world, a model and an algorithm are often used synonymously to refer to a set of instructions or a decision tree. We explained ML in detail in an earlier primer, but at the most basic level, it refers to the ability of specialized computer programs to learn certain tasks without being explicitly programed by humans. This is in contrast to software programs that are written by human coders. The IBM supercomputer that defeated the world chess champion in 1997 was programed by their coding team, who inputted specific instructions on how to respond to various chess moves made by their opponent. On the other hand, Google’s AlphaZero was able to defeat chess champions by using trial and error, advanced statistic and probabilistic reasoning to detect relationships, and patterns in its opponent’s approach—without human programmers—an example of machine learning.
Deep learning is a type of ML. It uses a network of artificial neurons to learn patterns within the data it is analyzing, which could be text, images, or audio; it then uses it to help clinicians make better decisions. An artificial neural network used to help screen patients for diabetic retinopathy is an example of deep learning.
While programs like this can accomplish amazing feats, they are designed to focus on a narrow domain and are thus trained on a data set for that domain, whether it be eye disease, skin cancer, or any number of other areas. Foundation models, on the other hand, paint with much broader strokes. Scott et al explain: “FM represent a paradigm shift in that, rather than having to develop a bespoke model for each specific use case (or task), a single FM can instead be reused or repurposed across a broad range of tasks with minimal adaptation or retraining needed for each task.” In the physical world, you can think of a FM as a Swiss Army knife, which may include a blade, a screwdriver, a can opener, and other tools, in contrast to a pocket knife that only contains one or two blades.
Large language models (LLMs) are a specific type of FM. They derive their strength from the fact that they draw on a massive collection of data. In the case of ChatGPT, that means over a trillion data points from across the entire Internet. Unfortunately, because this data set is filled with accurate and inaccurate information, the results can be “schizophrenic”, sometimes brilliant and sometimes hallucinogenic. Several developers have addressed this problem by creating chatbots that use more scientifically reliable data sets, including OpenEvidence, Consensus, PubMedGPT, BioGPT, and Med-Palm 2.
Because these digital tools are more reliable, many clinicians have been using them to supplement patient care. They are being used to answer patients’ email questions, compose discharge summaries, generate operative notes, collect evidence for a scientific paper, and much more. But clinicians still need to be cautious, especially if they want to use them for clinical decision support. If they rely too heavily on any of these LLMs to diagnose and treat patients, they fall into a trap because they make the mistake of thinking that a chatbot’s answers are a form of reasoning. They aren’t, at least not in the human sense of the word.
LLMs use statistical next word prediction, not deep causal reasoning. They can predict what the next most likely word is in a sentence by detecting word patterns in their data set. But that’s not reasoning. Put another way, they are optimized for “linguistic plausibility”, not facts that have been confirmed in the peer-reviewed medical literature or in electronic health records. In other words, they don’t have any ground truth to fall back on. Experienced clinicians, on the other hand, may use experience-based heuristics and deeper insights and draw on the work they did caring for patients with similar labs results, signs, and symptoms. They also take into account not only medical expertise but the values and preferences on the patient in front of them.
Despite all their limitations, foundation models are clearly a step forward in healthcare AI. We eagerly anticipate new developments in this growing field.
