Does Health Care AI have a Credibility Problem?
Many physicians ignore the recommendations provided by machine learning algorithms because they don’t trust them. A few imaginative software enhancements may give them more confidence in an algorithm’s diagnostic and therapeutic suggestions.
By John Halamka, M.D., president, Mayo Clinic Platform, and Paul Cerrato, senior research analyst and communications specialist, Mayo Clinic Platform
While developers continue to provide innovative AI tools that have the potential to redefine the practice of medicine, most physicians and nurses do not have the background in data science required to fully grasp what’s “under the hood,” one reason why many hesitate to incorporate the diagnostic and treatment recommendations coming from these algorithms.
This credibility dilemma can be addressed in two ways: at 20,000 feet, we can provide plain English descriptions of how modeling techniques work, approaches like convolutional neural networks, random forest analysis, gradient boosting and clustering. Several of our articles and books have done just that. Similarly, there are resources available on the JAMA site, and several consumer-oriented graphics available to accomplish that task. One JAMA video, for instance, illustrates how a deep learning neural network works. While these resources give clinicians more overall confidence in the ability of machine learning (ML) to augment diagnostic decision making, it does little to prove that a specific recommendation from a specific algorithm is generating credible advice for an individual patient. For that to occur, we need to drill down more deeply.
One solution is to use algorithms that incorporate enhanced software, including saliency maps and generative adversarial networks. To illustrate, let’s assume that we are trying to determine whether a patient has COVID-19. The gold standard is polymerase chain reaction with reverse transcription (RT-PCR) testing. In low-resource clinical settings, this may not be feasible, which is why technologists are developing deep-learning tools to help resolve the problem. One option is to use machine-learning-based algorithms to analyze chest X-rays to look for evidence of the infection. But if an algorithm simply tells a physician their patient has COVID-19 based on its analysis of the image without offering a rationale that they can understand, it’s unlikely it will be acted upon. An algorithm that has a saliency map built into it, on the other hand, can highlight the clusters of pixels in the X-ray that suggest the infection. Saliency maps have long been used in the image recognition space. Alex DeGrave, with the University of Washington, and his colleagues have used these saliency maps to demonstrate how easy it is to be misled by ML algorithms that are not properly vetted. Their analysis included X-ray images with the saliency maps overlaid on them; they appear as a series of red dots at specific locations on the images (Figure 1 in DeGrave et al), enabling clinicians to visually inspect those areas of the patient’s lungs that suggest COVID-19.
Similarly, such software enhancements have helped cardiologists identify patients with hypertrophic cardiomyopathy and cardiac amyloidosis by visualizing key sections of echocardiograms. Investigators from Cedars-Sinai Medical Center, Los Angeles, and Stanford University used a deep-learning algorithm to evaluate the thickness of patients’ left ventricular wall and cavity size. Increased wall thickness suggests the presence of these cardiac abnormalities which, in turn, allows clinicians to refer affected patients to specialty clinics. But instead of just providing physicians with a recommendation to refer based on a finding of wall thickness, the algorithm also provided videos containing heat maps that pinpointed areas in the echocardiograms that were suspect. (Two-dimensional heat maps are a way of visualizing an affected location using color.) Duffy et al state: “Key points along the septal wall and left ventricular posterior wall are identified, with a heatmap showing potential areas for annotation and subsequent measurements from the deep-learning algorithm. Because the model can systematically measure each frame of the video, beat-by-beat assessment of left ventricular dimensions is made possible, enabling higher accuracy and precision.” The algorithm classified cardiac amyloidosis with an area under the curve of 0.83; AUC for hypertrophic cardiomyopathy was 0.98.
Stanford University researchers have taken a similar approach to help clinicians comprehend the echocardiography recommendations coming from a deep-learning system. Amirata Ghrobani and associates trained a convolutional neural network (CNN) on over 2.6 million echocardiogram images from more than 2,800 patients and demonstrated it could identify enlarged left atria, left ventricular hypertrophy and several other abnormalities. Ghorbani et al presented readers with “biologically plausible regions of interest” in the echocardiograms they analyzed so they could see for themselves the reason for the interpretation that the model had arrived at. For instance, if the CNN showed it had identified a structure such as a pacemaker lead, it highlighted the pixels it identifies as the lead. Similar clinician-friendly images are presented for a severely dilated left atrium.
Clinicians have spent decades learning the complexities of the diagnostic reasoning process and have memorized countless disease scripts that enable them to perform expert care. While machine-learning-based algorithms are poised to augment these cognitive skills, developers are more likely to gain practitioners’ trust if they go the extra mile and provide the right visualization tools.