The Evolution of Generative AI for Healthcare

By John Halamka • November 1, 2024

Generative AI has limitations, but with each quarter, performance and adoption are growing at an unprecedented rate.

By Paul Cerrato, senior research analyst and communications specialist, Mayo Clinic Platform and John Halamka, M.D., President, Mayo Clinic Platform.

To many AI enthusiasts, 2022 was the year the earth stood still. On November 30, 2022, ChatGPT was first introduced, which got the attention of many other tech companies, who in turn developed their own generative AI systems. ChatGPT’s ability to closely mimic a human conversation, create original content, and even write computer code resulted in about 1 million users signing on in just five days. At last count, about 200 million people worldwide use it weekly, but its record-breaking popularity has been accompanied by unprecedented controversy and concerns about its ability to create believable fake videos, misleading news stories, and much more.

Although many technological advances led to the creation of these chatbots, a pivotal point came with the publication of an article by data scientists from Google and University of Toronto called “Attention is All You Need.” It highlighted two important technologies: transformers and the attention mechanism, which we discussed in more detail in an earlier blog. Since those early days, many technologists and developers have attempted to address the limitations of ChatGPT.

One recent attempt to enhance performance involved combining the results of several large language models to see if it might improve diagnostic accuracy. Gioele Barabucci, with the University of Cologne in Germany, and his colleagues combined the differential diagnosis lists generated by OpenAI GPT-4, Google PaLM 2, Cohere Command, and Meta Llama 2 and compared the results to the differential diagnosis list of individual chatbots with encouraging results. They started with 200 clinical vignettes derived from actual case studies from the Human Diagnosis Project platform. They found: “aggregating responses from multiple LLMs leads to more accurate differential diagnoses (average TOP-5 accuracy for three LLMs: 75.3%±1.6 percentage points) compared with the differential diagnoses produced by single LLMs (average TOP-5 accuracy for single LLMs: 59.0%±6.1 percentage points).” Their findings suggest that aggregating the diagnostic lists of several LLMs may achieve the kind of accuracy that clinicians would feel comfortable with. By way of comparison, when physicians and medical students used the Human Diagnosis Project to evaluate over 1,500 cases, pooled diagnostic accuracy only reached 62.5%. Of course, evaluating how clinicians respond to a canned clinical case is not the same of how they might perform in a live clinical scenario, but the research is a step in the right direction.

Tech companies are focused on enhancing reliability, consistency, and quality of generative AI. OpenAI, for instance, has just released a new version, OpenAI o1. OpenAI explains that the new models are designed “to spend more time thinking through problems before they respond, much like a person would. Through training, they learn to refine their thinking process, try different strategies, and recognize their mistakes.” Hopefully, this will reduce the risk of generating false or misleading information. The vendor has also implemented more safety measures, putting the model through new safety training, stating: “One way we measure safety is by testing how well our model continues to follow its safety rules if a user tries to bypass them (known as "jailbreaking"). On one of our hardest jailbreaking tests, GPT-4o scored 22 (on a scale of 0-100) while our o1-preview model scored 84.” In addition, OpenAI has formalized agreements with US and UK Safety Institutes.

Recent tests of OpenAI o1 for science have been promising. In the past, calculations have been problematic in some chatbots. When OpenAI tested o1 and GPT-4 against the International Mathematics Olympiad, it found the new version was able to correctly solve 83% of the problems, versus 13% for version 4. Others have said OpenAI o1 is better at scanning the scientific literature, “seeing what’s missing and suggesting interesting avenues for future research.” Similarly, at least one geneticist has found that version o1 is good at “connecting the dots between patient characteristics and genes for rare diseases". She says o1 "is more accurate and gives options I didn’t think were possible from a chatbot".

For those interested in understanding what’s “under the hood”, the latest version of the chatbot excels because it’s better at doing chain-of-thought logic. It spends more time “thinking” through all the intermediate steps required to reach a conclusion, and asking itself whether it’s arriving at the right answer to a person’s query.

Of course, even the most carefully developed chatbots still need guardrails to monitor their performance. That’s one of the goals of the Coalition for Health AI (CHAI). CHAI is a community of academic health systems, organizations, and expert artificial intelligence (AI) and data science practitioners. These members have come together to harmonize standards and reporting for health AI and educate end-users on how to evaluate these technologies to drive their adoption. CHAI recently published The Generative AI (GenAI) CHAI Best Practices Framework Guide (BPFG). It aims to provide all stakeholders involved in an AI-enabled solution – healthcare providers, hospital administrators, and researchers – with best practice guidance and a testing and evaluation framework. The BPFG leverages and augments the base CHAI Assurance Standard (AS) Guide – which consists of comprehensive AI assurance principles and considerations applicable at all stages of the AI lifecycle – and develops operational best practices along with testing and evaluation methods and metrics.

Within this BPFG, specific examples and use cases are described to ground best practice concepts to real-world problems and AI solutions implemented to solve these problems. Best Practice Guidance consists of actionable recommendations, protocols, or guidelines designed to ensure the safe, effective, and ethical use of AI technologies in healthcare. These guidelines focus on ensuring patient safety, optimizing clinical outcomes, promoting interoperability, and upholding ethical standards in the development, deployment, and monitoring of AI systems.

We often lecture about Generative AI and offer the following advice: In 2024, Generative AI has proven its value in lower risk use cases such as creating chart summaries, filling out forms, and assisting clinicians with documentation. In most settings, it has not been used to make diagnoses and treatment recommendations without extensive human supervision. In 2025, as new models are rigorously tested, there may be an increasing number of clinical use cases. For now, we anticipate progress, but remain committed to ensuring these rapidly evolving technologies do no digital harm.

The Evolution of Generative AI for Healthcare

Recent Posts