Lessons Learned from Misaligned Algorithms

By John Halamka • April 7, 2022

The potential benefits of AI-enabled algorithms have to be weighed against the risk of dataset shift, which can compromise their accuracy in ways that developers never anticipated.

By John Halamka, M.D., president, Mayo Clinic Platform, and Paul Cerrato, senior research analyst and communications specialist, Mayo Clinic Platform

In a previous blog, we pointed out that AI-based algorithms rely on math, not magic. But when reliance on that math overshadows clinicians’ diagnostic experience and common sense, they need to partner with their IT colleagues to find ways to reconcile artificial and human intelligence. A recent investigation spearheaded by STAT and MIT makes that recommendation painfully clear.

Our earlier post pointed out a problem in Epic Systems’ sepsis algorithm, which the EHR vendor said had a predictive rating of 0.76 to 0.83. The rating is derived from a metric called area under the curve (AUC); a 1.0 rating means that the algorithm is 100 per cent accurate in detecting the complication early on. A previous STAT analysis revealed that over time, the predictive value of the algorithm had dropped significantly, the result of unanticipated changes in patient characteristics over time, which had been caused by changes in the demographics of patients during the COVID-19 pandemic. In the latest STAT/MIT experiment, investigators broadened the scope of their inquiries to determine whether AI algorithms in general are vulnerable to these types of problems, which fall into the broad category of data shift.

Casey Ross, with STAT, explained that the team created three algorithms with the hope that they might predict three things: how long a patient would remain hospitalized, how likely they are to die, and their risk of developing sepsis. Because there is strong evidence to show that many algorithms gradually become less accurate over time, the researchers evaluated these three digital tools at three-year intervals to look for data shift. The algorithms were tested on 40,000 patients in ICUs at Beth Israel Deaconess Medical Center (BIDMC). They reported: “The algorithm had been trained to spot the warning signs of sepsis on ICU data between 2008 and 2010. At first, it was highly adept, registering an AUC of 0.73. The first signs of distress arose a few years later when the model started to predict sepsis in patients who didn’t go on to develop it, while overlooking those who became seriously ill. By the end of 2015, its AUC had dipped below 0.60.” The deterioration in performance was caused by changes in some of the algorithm’s input variables over time. Specifically, BIDMC had updated its International Classification of Disease coding system (ICD), a collection of codes used to describe a patient’s condition in the medical record. The medical center changed from ICD-9 to ICD-10 in 2015, which added several new codes, including those affecting the detection and tracking of sepsis related risk factors. Unfortunately, the change in ICD codes wasn’t the only problem that compromised the algorithms. BIDMC also acquired several hospitals during the experimental period, including Lahey Health in 2019. The resulting influx of patients from suburbia into the dataset had a major impact, once again compromising the algorithm’s accuracy.

In light of such potential confounders, Nigam Shah, M.B.B.S., Ph.D., professor of medicine and biomedical data science at Stanford University School of Medicine and chief data scientist for Stanford Health Care, suggested that such predictive tools be put on a schedule so that they can be regularly updated: “Ideally alerts would say something like - our AUC degrades by 0.26 per year, and we recommend re-training every eight months.”

For nearly a year, Dr. Shah and colleagues from academia, industry and government have convened to discuss ways to overcome this and challenges in clinical AI. We recently launched a Coalition for Health AI to help develop the standards, guardrails and guidance needed to enhance the reliability of clinical AI tools.

We will keep providing progress reports on this blog.

Lessons Learned from Misaligned Algorithms

Recent Posts