We Need to Open Up the AI Black Box
To convince physicians and nurses that deep learning algorithms are worth using in everyday practice, developers need to explain how they work in plain clinical English.
Paul Cerrato, senior research analyst and communications specialist, Mayo Clinic Platform, and John Halamka, M.D., president, Mayo Clinic Platform, wrote this article.
AI’s so-called black box refers to the fact that much of the underlying technology behind machine learning-enhanced algorithms is probability/statistics without a human readable explanation. Oftentimes that’s the case because the advanced math or the data science behind the algorithms is too complex for the average user to understand without additional training. Several stakeholders in digital health maintain, however, that this lack of understanding isn’t that important. They argue that as long as an algorithm generates actionable insights, most clinicians don’t really care about what’s “under the hood.” Is that reasoning sound?
Some thought leaders point to the fact that there are many advanced, computer-enhanced diagnostic and therapeutic tools currently in use that physicians don’t fully understand, but nonetheless accept. The CHA2DSA-VASc score, for instance, is used to estimate the likelihood of a patient with non-valvular atrial fibrillation having a stroke. Few clinicians are familiar with the original research or detailed reasoning upon which the calculator is based, but they nonetheless use the tool. Similarly, many physicians use the FRAX score to estimate a patient’s 10-year risk of developing a bone fracture, despite the fact that they have not investigated the underlying math.
It’s important to point out, however, that the stroke risk tool and the FRAX tool both have major endorsements from organizations that physicians respect. The American Heart Association and the American College of Cardiology both recommend the CHA2DSA-VASc score while the National Osteoporosis Foundation supports the use of FRAX score. That gives physicians confidence in these tools even if they don’t grasp the underlying details. To date, there are no major professional associations recommending specific AI-enabled algorithms to supplement the diagnosis or treatment of disease. The American Diabetes Association did include a passing mention of an AI-based screening tool in its 2020 Standards of Medical Care in Diabetes, stating: “Artificial intelligence systems that detect more than mild diabetic retinopathy and diabetic macular edema authorized for use by the FDA represent an alternative to traditional screening approaches. However, the benefits and optimal utilization of this type of screening have yet to be fully determined.” That can hardly be considered a recommendation.
Given this scenario, most physicians have reason to be skeptical, and surveys bear out that skepticism. A survey of 91 primary care physicians found that understandability of AI is one of the important attributes they want before trusting its recommendations during breast cancer screening. Similarly, a survey of senior specialists in UK found that understandability was one of their primary concerns about AI. Among New Zealand physicians, 88% were more likely to trust an AI algorithm that produced an understandable explanation of its decisions.
Of course, it may not be possible to fully explain the advanced mathematics used to create machine learning based algorithms. But there are other ways to describe the logic behind these tools that would satisfy clinicians. As we have mentioned in previous publications and oral presentations, there are tutorials available to simplify machine learning-related systems like neural networks, random forest modeling, clustering, and gradient boosting. Our most recent book contains an entire chapter on this digital toolbox. Similarly, JAMA has created clinician friendly video tutorials designed to graphically illustrate how deep learning is used in medical image analysis and how such algorithms can be used to help detect lymph node metastases in breast cancer patients.
These resources require clinicians to take the initiative and learn a few basic AI concepts, but developers and vendors also have an obligation to make their products more transparent. One way to accomplish that goal is through saliency maps and generative adversarial networks. Using such techniques, it’s possible to highlight the specific pixel grouping that a neural network has identified as a trouble spot, which the clinician can then view on a radiograph, for example. Alex DeGrave, with the University of Washington, and his colleagues, used this approach to help explain why an algorithm designed to detect COVID-19-related changes in chest X-rays made its recommendations. Amirata Ghrobani and associates from Stanford University have taken a similar approach to help clinicians comprehend the echocardiography recommendations coming from a deep learning system. The researchers trained a convolutional neural network (CNN) on over 2.6 million echocardiogram images from more than 2,800 patients and demonstrated it was capable of identifying enlarged left atria, left ventricular hypertrophy, and several other abnormalities. To open up the black box, Ghorbani et al presented readers with “biologically plausible regions of interest” in the echocardiograms they analyzed so they could see for themselves the reason for the interpretation that the model has arrived at. For instance, if the CNN said it had identified a structure such as a pacemaker lead, it highlighted the pixels it identifies as the lead. Similar clinician-friendly images are presented for a severely dilated left atrium and for left ventricular hypertrophy.
Deep learning systems are slowly ushering in a new way to manage diagnosis and treatment, but to bring skeptical clinicians on board, we need to pull the curtain back. In addition to providing evidence that these tools are equitable and clinically effectively, practitioners want reasonable explanations to demonstrate that they will do what they claim to do.
Recent Posts
By John Halamka, Paul Cerrato, and Teresa Atkinson — Many clinicians are well aware of the shortcomings of LLMs, but studies suggest that retrieval-augmented generation could help address these problems.
By John Halamka and Paul Cerrato — Large language models rely on complex technology, but a plain English tutorial makes it clear that they use math, not magic to render their impressive results.
By John Halamka and Paul Cerrato — Many algorithms only reinforce a person’s narrow point of view, or encourage existing prejudices. There are better alternatives.