Instructive Mistakes and Actionable Insights

Some clinicians reject any AI-infused recommendations, while others are so confident in these tools that they forfeit their independent diagnostic skills. A closer look at the research can turn these mistakes into insights.

By John Halamka, M.D., president, Mayo Clinic Platform, and Paul Cerrato, senior research analyst and communications specialist, Mayo Clinic Platform

“AI is Better at Diagnosing Skin Cancer than Your Doctor, Study Finds.”
“Artificial Intelligence Outperforms Real Doctors, Studies Show.”

Headlines like these have fired the imaginations of patients and technology enthusiasts, but have also frustrated many seasoned clinicians who see them as sensationalistic hype. As is often the case, the truth can’t be easily condensed into a catchy headline; nor can such headlines be condemned by the snap judgements of health care professionals who don’t have the time to dispassionately review the evidence.

Myura Nagendaran, with Imperial College London, and his colleagues have done just that. They analyzed over 7,000 study records—including 140 articles--and nearly 1,000 trial registrations to answer the question: How do deep learning algorithms compare to clinicians for medical imaging? As Nagendaran et al point out, deep learning tools like convolutional neural networks (CNN) can analyze raw data from thousands of medical images and isolate clusters of pixels that suggest pathology. Because computers are capable of detecting subtle differences in the millions of pixels that compose each image, they have the potential to see patterns that are impossible for humans to see. They have been used to augment a clinician’s skills in screening diabetic retinopathy, identifying congenital cataracts, detecting small colon polyps, and making prognostic predictions on mortality.

Nagendaran et al’s search of published studies and ongoing trials located only 10 deep learning randomized clinical trials, as well as 81 non-randomized clinical trials. Only 9 of this latter collection were prospective and 6 were tested in a real-world clinical setting. Despite these limitations, 61 of the 81 studies “stated in their abstract that performance of artificial intelligence was at least comparable to (or better than) that of clinicians.” Given these claims, it’s not surprising to find overoptimistic headlines in the popular press.

While these results indicate that there is, in fact, a lot of hype surrounding deep learning, it’s also important to point out that since their analysis was conducted, several prospective studies have been published that support the real world use of AI algorithms in clinical practice. Todd Hollon and colleagues performed interoperative diagnosis of brain cancer using Stimulated Raman histology, an advanced optical imaging method, along with a CNN to help interpret surgical specimens. The machine learning tools were trained to detect 13 histologic categories and included an inference algorithm to help make a diagnosis of brain cancer. Hollon et al conducted a 2-arm, prospective multicenter, non-inferiority trial to compare the CNN results to those of human pathologists. The trial, which evaluated 278 specimens, demonstrated that the machine learning system was as accurate as pathologists’ interpretation (94.6% vs 93.9%). Equally important was the fact that it took under 15 seconds for surgeons to get their results with the AI system, compared to 20-30 minutes with conventional techniques.

There have also been prospective studies that have shown the value of deep learning in areas other than medical imaging. Mayo Clinic investigators demonstrated the real-world value of deep learning in the EAGLE trial, a pragmatic randomized clinical trial. Using an AI algorithm and ECGs enabled primary care physicians to improve the diagnosis of asymptomatic left ventricular systolic dysfunction by detecting low ejection fraction. In absolute terms, for every 1,000 patients screened, the AI system generated five new diagnoses of low EF compared to usual care.

Similarly, the same data set used in the EAGLE trial was used to demonstrate that an EKG/AI algorithm can predict long-term mortality after cardiac surgery. Abdulah Mahayni and associates found that “A novel electrocardiography-based AI algorithm that predicts severe ventricular dysfunction can predict long-term mortality among patients with LVEF above 35% undergoing valve and/or coronary bypass surgery.”

A prospective study conducted by Nathan Brajer and his colleagues at Duke University School of Medicine found that a machine learning model could predict adult in-hospital mortality at the time of admission. They collected EHR data from about 31,000 patients who were admitted to an academic medical center and were able to predict mortality with an area under the receiver operating curve (AUROC) between 0.84 and 0.89.

Finally, Jeonghyuk Park and associates from South Korea prospectively evaluated a deep learning algorithm designed to help diagnose gastric tumors located during endoscopic biopsies. Their analysis generated an impressive AUROC of 0.9790 for two-tier classification of whole slide images, i.e., negative versus positive detection of stomach cancer.

In the conclusion of the study conducted by Dr Park and colleagues, they remind readers that their digital tool has “benefits as an assistance tool,” a point well taken by any clinician who has their doubts about using deep learning algorithms in clinical medicine.

Looking at the preponderance of the evidence, one lesson stands out: A well-documented deep learning algorithm is unlikely to replace clinicians but is likely to enhance human decision-making. With that in mind, we encourage the next generation of clinicians to be familiar enough with these technologies to decide when and how they should be used.

Recent Posts