Will Retrieval-Augmented Large Language Models “Save the Day”?

By John Halamka • September 9, 2024

Many clinicians are well aware of the shortcomings of large language models like ChatGPT. Several studies suggest that retrieval-augmented generation, by tapping into more reliable data sets, can help address these problems.

By Paul Cerrato, MA, senior research analyst and communications specialist, Mayo Clinic Platform, Teresa Atkinson, MA, Mayo Clinic Platform Education Operations Manager, Instructor in Healthcare Administration, and John Halamka, M.D., Diercks President, Mayo Clinic Platform.

As nurses and physicians in clinical practice struggle to keep up with the demands of patient care, many are turning to chatbots powered by large language models (LLMs) to lighten their load. But doubts still linger in the minds of many clinicians about the trustworthiness of these digital tools. We’ve all heard the troubling accounts of hallucinating chatbots and worry that they may be risking patients’ health. Several thought leaders and researchers have been attempting to address these concerns. Among the possible solutions are retrieval-augmented generation (RAG).

As explained in an earlier article, RAGs include carefully curated content that has been vetted by healthcare experts to reduce the likelihood that LLMs will generate fabricated or inaccurate answers. Chatbots like ChatGPT, on the other hand, derive their content from the entire Internet, including all the facts and fallacies it contains. The research that compares the performance of RAG-enhanced LLMs to the more generic models currently in use is promising.

A recent study evaluated the value of a retrieval-augmented LLM to determine if it might improve the ability of an AI-driven algorithm to answer questions related to ophthalmology. The new LLM, called ChatZOC, included a data set containing over 30,000 pieces of ophthalmic knowledge. It was compared to 10 existing LLMs, including ChatGPT-4 and ChatGPT-3.5. As Luo et al explained: “The evaluation, involving a panel of medical experts and biomedical researchers, focused on accuracy, utility, and safety. A double-masked approach was used to try to minimize bias assessment across all models.” When the investigators compared ChatZOC to a baseline model called Baichuan-2, they found their chatbot aligned with scientific consensus 84% of the time, vs 46.5% for Baichuan-2. However, ChatZOC offered no significant improvement when compared to GPT-4 (84% vs 79%, P= .06).

Another study demonstrated the potential benefits of integrating RAG into large language models in nephrology. RAGs enhanced output accuracy by integrating external data, making it better suited for clinical decision-making. Miao et al explained: “A specialized ChatGPT model integrated with a RAG system, aligned with KDIGO 2023 guidelines for chronic kidney disease, demonstrates the ability to provide specialized, accurate medical advice and improve nephrology practices.”

An additional study explored the potential of integrating RAGs with LLMs such as OpenAI's GPT models, to standardize emergency medical triage and reduce variability caused by personnel experience and training. Using 100 simulated scenarios based on the Japanese National Examination for Emergency Medical Technicians, the RAG-enhanced GPT-3.5 model achieved a correct triage rate of 70%, significantly outperforming EMTs and emergency physicians. Additionally, the RAG-enhanced model reduced under-triage rates to 8%, compared to higher rates in models without RAG. Yazaki et al explains: “These results suggest that RAG-enhanced LLMs could improve the accuracy and consistency of emergency triage. Though further validation in diverse settings is required, the results of the study provide a foundation for the development of advanced emergency medical support systems that incorporate LLMs.”

Although all three of these studies advance the case for using RAG to improve the accuracy of LLMs, they nonetheless have shortcomings that clinicians need to be aware of. A closer look at the research on ED triage, for instance, found that the cases being evaluated were not actual emergency medical cases but only simulated cases, and those simulations were derived from the national examination for EMTs. Exam questions rarely duplicate the complexities of everyday ED care. In addition, the cases used structured inputs to make triage decisions. In the real world, emergency care is much “messier” and often involves scribbled narrative notes. The investigators also point out that their RAG-enhanced model improved triaging among EMTs and emergency department physicians, not specialists.

These limitations, along with several others outlined in the research literature, demonstrate that while retrieval-augmented generation is promising, it certainly will not “save the day.” The technology needs to be further refined and tested in more real-world settings.

Will Retrieval-Augmented Large Language Models “Save the Day”?

Recent Posts