Large Language Models: Establishing a Better Roadmap
Understanding their limitations, defining the best use cases, and closing the gaps in transparency and trustworthiness are the keys to responsible LLM adoption.
Sonya Makhni, M.D., M.B.A., M.S., Medical Director and clinical informaticist at Mayo Clinic Platform, wrote this article
The debate over large language models (LLM) continues. While many stakeholders view models such as GPT-4 as potentially transformative, others fear negative repercussions down the road. While LLMs have the potential to solve many health care problems by reducing administrative burdens, helping to consolidate information, and collecting multimodal data, these AI tools also pose many challenges. Some thought leaders have even recommended pausing their further development and training, calling for appropriate governance and safety protocols to help mitigate the potential negative impacts of these tools on patient care.
However, the reality is that a nationwide moratorium on model production and development is unlikely and perhaps unnecessary. Embracing new technologies is critical, and so is a careful and prescriptive effort to lay appropriate guard rails. Time and time again, we find ourselves in situations where we haven't yet paved the way for new technologies and innovations. With this reality in mind, we suggest a "cautious embrace." It's time we quickly shift conversations away from the polarized ends of the spectrum and instead focus on meeting in the middle. We need to develop an approach to LLM adoption in health care that reflects the limitations to take actionable steps in the future.
Understanding the Limitations
An incomplete understanding of any invention, innovation, or technology will create angst and distrust – a dilemma only compounded by AI's "black-box" nature and the deep complexities of LLMs. We first need to understand what LLMs cannot do to resolve this dilemma. Here's a partial list to consider:
- Some LLMs are trained on vast amounts of unverified, unfiltered, widely available data, a collection that includes facts, misinformation, and bias.
- The data sources used in model training for some LLMs may be unavailable. This means we do not know how much of the training data comprises peer-reviewed articles vs. other unfiltered and unverified sources.
- LLMs don't have a very good "memory," i.e. if a fact is true but uncommonly "known" in the data sets they scrape from the internet, it will likely be under-represented in LLM-generated predictions.
- LLMs (and AI models generally) cannot differentiate fact from fiction. They cannot make judgments on data quality or integrity; they cannot make ethical or moral judgments.
- LLMs are subject to temporal degradation, as with many other AI models. In other words, the quality of the predictions degrades over time as the training data becomes outdated with respect to new, evolving data.
- By their nature, AI algorithms are probabilistic. They learn patterns and make predictions based on complex statistical modeling and methods, which is why LLM predictions may change. This implies inherent limitations to reliability and reproducibility. For example, the predicted output may differ on subsequent queries of the same question.
Taken together, this means that models can learn from poor quality or harmful data, which in turn may translate into false, harmful predictions. This also means that biases in our data may find their way into the final output.
Even if LLMs learn from high-quality and factual data, models can still produce inconsistent results that change from moment to moment. They may even "hallucinate," as is well documented in the literature.
Compounding these issues is that models themselves cannot alert anyone when these issues arise.
Defining Use Cases
Given these limitations, some tasks are inappropriate for LLMs to undertake at this point in their evolution. Currently, LLMs may be poorly suited to tasks that directly dictate patient care and in situations where bias and inequity are already of concern.
Instead, it makes more sense to focus first on less-risky LLM use cases and ones where human oversight can more easily mitigate bias and harm. Similarly, we might consider tasks that augment the end-user in completing administrative responsibilities and avoid those that require clinical judgment or ethical thinking. Consider the following non-exhaustive list of possible early applications:
- Medical notes: LLMs may be extremely useful in generating the history of the present illness or suggesting an order entry. Assessments and recommendations may best be reserved for the clinician to avoid errors caused by LLM hallucinations or misinformation.
- Patient communications: With oversight from clinicians, LLMs may be able to answer patients' simple questions.
- Manuscripts: Such tools may assist authors and editors with grammar and review. Scientific content and citations, however, should be avoided.
- Education: LLMs can be trained on validated texts and literature and serve as a tool for students, trainees, and practicing clinicians to advance and refine their medical knowledge.
Addressing the Gaps
LLM shortcomings have solutions, but each problem requires a thorough and comprehensive approach. We first need to address data transparency and trustworthiness. Model developers need to be transparent about their data sources. They should catalog relevant metadata and make it available to model end-users. Source type, author or creator, geographic origin, and publication date - among many other pieces of information - should be listed, disseminated, and updated.
Data credibility is also a key issue to address. If we are to develop models used in the real world, we are responsible for using trustworthy data or, at minimum, understanding the degree of the data's credibility. Not all data is equal – some represent opinions, others facts.
How do we codify data according to trust, reliability, and fairness, and what tools do we build to help support this?
We also must ensure that data used to train models respect patient privacy and comply with regulatory requirements. These conversations may be difficult and potentially controversial, but we must have them.
Next is the issue of data quality. Data quality in health care is notoriously poor. While this often refers to consistency, accuracy, precision, and timeliness of structured data elements, LLMs will also draw on unstructured data elements that are highly inconsistent. Physician notes, for example, are littered with outdated or potentially inaccurate information. If we are to use physician notes in model training, we need to think carefully about the risks and benefits of using all data indiscriminately vs. a curated subset.
In addition, we need to develop technical and innovative tools that address the issues just discussed. Specifically, we need tools that enable us to audit predicted text from an LLM. Can we present the users with a degree of confidence or certainty about the results, alerting them of possible hallucinations? Such tools to provide transparency need to be developed in tandem with the models themselves. They must incorporate audit trails so end-users can see what kinds of data were used to inform predictions. These challenging tasks will likely require significant expertise and resources to develop.
As with any other AI model, we will need to validate a model's output as useful, accurate, generalizable, and equitable. What tools specific to LLMs need to be developed to accomplish this?
Finally, many thought leaders are calling for rigorous development of standards, regulatory requirements, and ethical guardrails. We look toward organizations, such as CHAI (Coalition for Health AI), to provide guidance for model developers.
It is equally essential that model developers actively engage with AI ethicists and clinicians who understand bias and inequities. Model developers must ensure their training data sets comply with regulations and patient privacy. Healthy tensions should be actively sought out, not avoided. These interactions will lead to important discussions that will ultimately help ground these quickly developing technologies in sound, ethical foundations.
However, remember that successfully implementing LLMs into workflows does not stop here. We need to address many important tactical questions. One is intellectual property. LLMs "memorize" training data – what are the IP implications on those entities providing the data? We risk ostracizing data contributors if we cannot provide appropriate guardrails and incentive structures. We must devise business models to support model developers, data providers, and end-users.
Let's bring model developers, clinicians, ethicists, computational linguists, and policymakers to the table to have these complex conversations. Let's shift our dialogue toward action to develop the tools and frameworks that will ultimately empower LLMs to improve patient care ethically and equitably.
Flanagin A, Bibbins-Domingo K, Berkwits M, Christiansen SL. Nonhuman "Authors" and Implications for the Integrity of Scientific Publication and Medical Knowledge. JAMA. 2023;329(8):637–639. doi:10.1001/jama.2023.1344
Harrer, Stefan. "Attention is not all you need: the complicated case of ethically using large language models in health care and medicine." Ebiomedicine 90 (2023).
Mallen, Alex, et al. "When Not to Trust Language Models: Investigating Effectiveness and Limitations of Parametric and Non-Parametric Memories." arXiv preprint arXiv:2212.10511 (2022).
Merrill, William, et al. "Provable limitations of acquiring meaning from ungrounded form: What will future language models understand?." Transactions of the Association for Computational Linguistics 9 (2021): 1047-1060.
van Dis, Eva A M, et al. "ChatGPT: five priorities for research." Nature vol. 614,7947 (2023): 224-226. doi:10.1038/d41586-023-00288-7