Dataset shift can thwart the best intentions of algorithm developers and tech-savvy clinicians, but there are solutions.
John Halamka, M.D., president, Mayo Clinic Platform, and Paul Cerrato, senior research analyst and communications specialist, Mayo Clinic Platform, wrote this article.
Generalizability has always been a concern in health care, whether we’re discussing the application of clinical trials or machine-learning based algorithms. A large randomized controlled trial that finds an intensive lifestyle program doesn’t reduce the risk of cardiovascular complications in Type 2 diabetics, for instance, suggests the diet/exercise regimen is not worth recommending to patients. But the question immediately comes to mind: Can that finding be generalized to the entire population of Type 2 patients? As we have pointed out in other publications, subgroup analysis has demonstrated that many patients do, in fact, benefit from such a program.
The same problem exists in health care IT. Several algorithms have been developed to help classify diagnostic images, predict disease complications, and more. A closer look at the datasets upon which these digital tools are based indicates many suffer from dataset shift. In simple English, dataset shift is what happens when the data collected during the development of an algorithm changes over time and is different from the data when the algorithm is eventually implemented. For example, the patient demographics used to create a model may no longer represent the patient population when the algorithm is put into clinical use. This happened when COVID 19 changed the demographic characteristics of patients, making the Epic sepsis prediction tool ineffective.
Samuel Finlayson, PhD, with Harvard Medical School, and his colleagues described a long list of data set shift scenarios that can compromise the accuracy and equity of AI-based algorithms, which in turn can compromise patient outcomes and patient safety. Finlayson et al list 14 scenarios, which fall into 3 broad categories: changes in technology; changes in population and setting; and changes in behavior. Examples of ways in which dataset shift can create misleading outputs that send clinicians down the wrong road include:
- Changes in the X-ray scanner models used
- Changes in the way diagnostic codes are collected (e.g. using ICD9 and then switching to ICD10)
- Changes in patient population resulting from hospital mergers
Other potential problems to be cognizant of include changes in your facility’s EHR system. Sometimes updates to the system may result in changes in how terms are defined, which in turn can impact predictive algorithms that rely on those definitions. If a term like elevated temperature or fever is changed to pyrexia in one of the EHR drop down menus, for example, it may no longer map to the algorithm that uses elevated temperature as one of the variable definitions to predict sepsis, or any number of common infections. Similarly, if the ML-based model has been trained on a patient dataset for a medical specialty practice or hospital cohort, it’s likely that data will generate misleading outputs when applied to a primary care setting.
Finlayson et al mention another example to be aware of: changes in the way physicians practice can influence data collection: “Adoption of new order sets, or changes in their timing, can heavily affect predictive model output.” Clearly, problems like this necessitate strong interdisciplinary ties, including an ongoing dialogue between the chief medical officer, clinical department heads, and chief information officer and his or her team. Equally important is the need for clinicians in the trenches to look for subtle changes in practice patterns that can impact the predictive analytics tools currently in place. Many dataset mismatches can be solved by updating variable mapping, retraining or redesigning the algorithm, and multidisciplinary root cause analysis.
While addressing dataset shift issues will improve the effectiveness of your AI-based algorithms, they are only one of many stumbling blocks to contend with. One classic example that demonstrates that computers are still incapable to matching human intelligence is the study that concluded that patients with asthma are less likely to die from pneumonia that those who don’t have asthma. The machine learning tool used to come to that unwarranted conclusion had failed to take into account the fact that many asthmatics often get faster, earlier, more intensive treatment when their condition flares up, which results in a lower mortality rate. Had clinicians acted on the misleading correlation between asthma and fewer deaths from pneumonia, they might have decided asthma patients don’t necessarily need to hospitalized when they develop pneumonia.
This kind of misdirection is relatively common and emphasizes the fact that ML-enhanced tools sometimes have trouble separating useless “noise” from meaningful signal. Another example worth noting: Some algorithms designed to help detect COVID 19 by analyzing X-rays suffer from this shortcoming. Several of these deep learning algorithms rely on confounding variables instead of focusing on medical pathology, giving clinicians the impression that they are accurately identifying the infection or ruling out its presence. Unbeknownst to their users, the algorithms have been shown to rely on text markers or patient positioning instead of pathology findings.
At Mayo Clinic, we have had to address similar problems. A palliative care model that was trained on data from the Rochester, Minnesota, community, for instance, did not work well in our health system because the severity of patient disease in a tertiary care facility is very different than what’s seen in a local community hospital. Similarly, one of our algorithms broke when a vendor did a point release in its software and changed the format of the results. We also had a vendor with a CT stroke detection algorithm run 10 of our known stroke patients through its system and was only able to identify one patient. The root cause: Mayo Clinic medical physicists have optimized our radiation exposure to 25% of industry standards to reduce radiation exposure to patients, but that changed the signal to noise ratio of the images and the vendor’s system wasn’t trained on that ratio and couldn’t find the images.
Valentina Bellini, with University of Parma, Parma, Italy, and her colleagues sum up the AI shortcut dilemma in a graphic that illustrates 3 broad problem areas: Poor quality data, ethical and legal issues, and lack of educational programs for clinicians who may be skeptical or uninformed about the value and limitations of AI enhanced algorithms in intensive care settings.
As we have pointed out in other blogs, ML-based algorithms rely on math, not magic. But when reliance on that math overshadows clinicians’ diagnostic experience and common sense, they need to partner with their IT colleagues to find ways to reconcile artificial and human intelligence.
By John Halamka and Paul Cerrato — Mayo Clinic Platform_Connect is transforming how patient data is used to generate innovative diagnostic tools and treatment options.
By John Halamka and Paul Cerrato — What happens when ChatGPT-4 and a human cardiologist are asked to diagnose the same patient? The results are quite revealing.
By John Halamka and Paul Cerrato — Several thought leaders and stakeholders have joined forces to create GoodDx.org, a searchable database that has the potential to reduce the human suffering affecting millions of Americans.