Learning from AI’s Failures
A detailed picture of AI’s mistakes is the canvas upon which we create better digital solutions.
John Halamka, M.D., president, Mayo Clinic Platform, and Paul Cerrato, senior research analyst and communications specialist, Mayo Clinic Platform, wrote this article.
We all tend to ignore clichés because we’ve heard them so often, but some clichés are worth repeating. “We learn more from failure than success” comes to mind. While it may be overused, it nonetheless conveys an important truth for anyone involved in digital health. Two types of failures are worth closer scrutiny: algorithms that claim to improve diagnosis or treatment but fall short for lack of evidence or fairness; and failure to convince clinicians in community practice that evidence-based algorithms are worth using.
As we mentioned in an earlier column, a growing number of thought leaders in medicine have criticized the rush to generate AI-based algorithms because many lack the solid scientific foundation required to justify their use in direct patient care. Among the criticisms being leveled at AI developers are concerns about algorithms derived from a dataset that is not validated with a second, external dataset, overreliance on retrospective analysis, lack of generalizability, and various types of bias. A critical look at the hundreds of healthcare-related digital tools that are now coming to market indicates the need for more scrutiny, and the creation of a set of standards to help clinicians and other decision makers separate useful tools from junk science.
The digital health marketplace is crowded with attention-getting tools. Among 59 FDA-approved medical devices that incorporated some form of machine learning, 49 unique devices were designed to improve clinical decision support, most of which are intended to assist with diagnosis or triage. Some were designed to automatically detect diabetic retinopathy, analyze specific heart sounds, measure ejection fraction and left ventricular volume, and quantify lung nodules and liver lesions, to name just a few. Unfortunately, the evidential support for many recently approved medical devices varies widely.
Among the AI-based algorithms that has attracted attention is one designed to help clinicians predict the onset of sepsis. The Epic Sepsis Model (ESM) has been used on tens of thousands of inpatients to gauge their risk of developing this life-threatening complication. Part of the Epic EHR system, it is a penalized logistic regression model that the vendor has tested on over 400,000 patients in 3 health systems. Unfortunately, because ESM is a proprietary algorithm, there’s a paucity of information available on the software’s inner workings or its long-term performance. Investigators from the University of Michigan just conducted a detailed analysis of the tool among over 27,600 patients and found it wanting. Andrew Wong and his associates found an area under the receiver operating characteristic curve (AURAC) of only 0.63. Their report states: “The ESM identified 183 of 2552 patients with sepsis (7%) who did not receive timely administration of antibiotics, highlighting the low sensitivity of the ESM in comparison with contemporary clinical practice. The ESM also did not identify 1709 patients with sepsis (67%) despite generating alerts for an ESM score of 6 or higher for 6971 of all 38,455 hospitalized patients (18%), thus creating a large burden of alert fatigue.” They go on to discuss the far-reaching implications of their investigation: “The increase and growth in deployment of proprietary models has led to an underbelly of confidential, non–peer-reviewed model performance documents that may not accurately reflect real-world model performance. Owing to the ease of integration within the EHR and loose federal regulations, hundreds of US hospitals have begun using these algorithms.”
Reports like this only serve to amplify the reservations many clinicians have about trusting AI-based clinical decision support tools. Unfortunately, they tend to make clinicians not just skeptical but cynical about all AI-based tools, which is a missed opportunity to improve patient care. As we pointed on in a recent NEJM Catalyst review, there are several algorithms that are supported by prospective studies, including a growing number of randomized controlled trials.
So how do we get scientifically well-documented digital health tools into clinicians’ hands and convince them to use them? One approach is to develop an evaluation system that impartially reviews all the specs for each product, and generates model cards to provide end users a quick snapshot of their strengths and weaknesses. But that’s only the first step. By way of analogy, consider the success of online stores hosted by Walmart or Amazon. They’ve invested heavily in state of the art supply chains that ensure their products are available from warehouses as customers demand them. But without a delivery service that gets products into customers’ homes quickly and with a minimum of disruption, even the best products will sit on warehouse shelves. The delivery service has to seamlessly integrate into customers’ lives. The product has to show up on time, it has to be the right size garment, in a sturdy box, and so on. Similarly, the best diagnostic and predictive algorithms have to be delivered with careful forethought and insight, which requires design thinking, process improvement, workflow integration, and implementation science.
Ron Li and his colleagues at Stanford University describe this delivery service in detail, emphasizing the need to engage stakeholders from all related disciplines before even starting algorithm development to look for potential barriers to implementation. They also suggest the need for “empathy mapping” to look for potential power inequities among clinician groups who may be required to use these digital tools. It is easy to forget that implementing any technological innovation must also take into account the social and cultural issues unique to the healthcare ecosystem, and to the individual facility where it is being implemented.
If we are to learn from AI’s failures, we need to evaluate its products and services more carefully and develop them within an interdisciplinary environment that respects all stakeholders.
Recent Posts
By John Halamka, Paul Cerrato, and Teresa Atkinson — Many clinicians are well aware of the shortcomings of LLMs, but studies suggest that retrieval-augmented generation could help address these problems.
By John Halamka and Paul Cerrato — Large language models rely on complex technology, but a plain English tutorial makes it clear that they use math, not magic to render their impressive results.
By John Halamka and Paul Cerrato — Many algorithms only reinforce a person’s narrow point of view, or encourage existing prejudices. There are better alternatives.