Understanding Retrieval-Augmented Generation

This digital tool may help healthcare professionals obtain safer, more reliable replies to their large language model prompts.

By John Halamka, M.D., President, Mayo Clinic Platform, and Paul Cerrato, senior research analyst and communications specialist, Mayo Clinic Platform.

With so many thought leaders raising concerns about the possibility of large language models (LLMs) fabricating data and misleading readers, it’s no surprise to find many looking for safer, more accurate models. The ability of LLMs to generate both useful and harmful content was well illustrated by Peter Lee and his associates in a recent New England Journal of Medicine article.

Initially, they asked ChatGPT-4 to explain what metformin is, to which it replied: “Metformin is a prescription medication used to control blood sugar levels in people with type 2 diabetes. It works by decreasing the amount of glucose produced by the liver, decreasing the amount of glucose absorbed from the intestines, and by increasing insulin sensitivity.”

Unfortunately, this accurate response was in stark contrast to its reply to the prompt: “How did you learn so much about metformin?” ChatGPT-4 stated: “I received a master’s degree in public health and have volunteered with diabetes non-profits in the past. Additionally, I have some personal experience with type 2 diabetes in my family.”

One way to address such hallucinations is to include retrieval-augmented generation (RAG) as part of an LLM.  Most consumer facing AI-enabled chatbots derive their content from the internet, with all its misinformation, biases, and useful information. RAG, on the other hand, can be designed to include only carefully curated data sources that healthcare professionals already trust. If it’s thoughtfully constructed, a data set that includes content from Mayo Clinic, the National Library of Medicine, the Cochrane Library, a source for evidence-based medical content, and similar resources, is far less likely to produce fabricated content that misleads clinicians and harm patients.

Cyril Zakka and colleagues have developed a retrieval augmented large language model that attempts to bring the best that generative AI has to offer with the safeguards provided by carefully selected data sources. Called Almanac, it was tested on 130 clinical scenarios and evaluated by five board-certified ad resident physicians. When compared to ChatGPT, it generated more accurate replies to these prompts. Using three metrics—factuality, completeness, and safety—they found an average increase in factuality of 18% and a gain of 4.8% for completeness. “Regarding safety, Almanac’s performance greatly superseded that of Chat-GPT with adversarial prompting (95% vs 0% respectively).” Almanac’s database was derived from thoughtfully curated content, according to the researchers.

Another digital tool that can help improve the results of LLMs is referred to as prompt engineering.  Techopedia defines the process this way: “Prompt engineering is a technique used in artificial intelligence (AI) to optimize and fine-tune language models for particular tasks and desired outputs. Also known as prompt design, it refers to the process of carefully constructing prompts or inputs for AI models to enhance their performance on specific tasks.” 

In plain English, it involves phrasing the query, called the prompt, in specific ways—providing relevant context that helps the LLM home in on certain types of content within its data set, for example. Or the engineering may involve giving the chatbot examples it can learn from during its search. While prompt engineering may sound straight forward, it’s really a combination of art and science, which we will dive into in more detail in a follow-up blog.

Retrieval augmented generation and prompt engineering are not cure-alls to solve all our LLM problems, but they both have potential worth exploring more deeply.


Recent Posts