Should AI-Driven Algorithms Serve as Diagnostic Assistants?
Several of these digital tools are supported by strong evidence and are worth considering, not to replace your clinical judgement, but to augment it.
By Paul Cerrato, MA, senior research analyst and communications specialist, Mayo Clinic Platform, and John Halamka, M.D., Diercks President, Mayo Clinic Platform.
There’s wisdom to be found in many tired cliches. “Don’t throw out the baby with the bathwater” comes to mind. Many clinicians look at all the mistakes and fabrications produced by AI algorithms and decide it’s best to ignore all of these digital tools, concluding that they don’t offer any help in arriving at a diagnosis. Is point of view justified in light of the published evidence?
Investigators with Cohen Children’s Medical Center in New York recently fed 100 pediatric case reports from JAMA Pediatrics and New England Journal of Medicine into ChatGPT3.5 to determine how accurate it was in generating a differential diagnosis list and a final diagnosis. When compared to ground truth, namely physicians’ definitive diagnoses, the chatbot had an error rate of 83%. Their final report concluded: “Most of the incorrect diagnoses generated by the chatbot (47of83 [56.7%]) belonged to the same organ system as the correct diagnosis (eg, psoriasis and seborrheic dermatitis) but were not specific enough to be considered correct (eg, hypoparathyroidism and hungry bone syndrome).”
Similarly, an analysis of 150 medical cases from Medscape found that ChatGPT wasn’t an accurate diagnostic tool. While it answered about half of the cases correctly, the area under the curve statistic (AUC) was only 66%. The report concluded: “Based on our qualitative analysis, ChatGPT struggles with the interpretation of laboratory values, imaging results, and may overlook key information relevant to the diagnosis.”
A recent study in NEJM AI also evaluated the potential value of several large language models to see if they could be used to automate clinical coding. They looked at the performance of GPT-3.5, GPT-4, Gemini Pro, and Llama2-70b and found: “All tested LLMs performed poorly on medical code querying, often generating codes conveying imprecise or fabricated information.”
Based on this “dirty bathwater”, how should clinicians view AI-enabled algorithms designed to improve diagnostic and therapeutic decision making? The research cited above only tells half of the story. Some of the strongest evidence in favor of using AI algorithms comes from the American Gastroenterology Association. It issued a clinical practice update on the role of AI in colon polyp diagnosis and management: “…a myriad of studies has reported the successful application of AI for the recognition of colon polyps using CADe [computed assisted detection]. These algorithms are the equivalent of a highly trained set of eyes relentlessly scanning the monitor alongside the endoscopist, while simultaneously “flagging” lesions that potentially represent precancerous polyps.”
There’s also strong evidence to support the use of AI algorithms to assist in the detection of melanoma. Several early studies suggested their value but because they were retrospective in design, thought leaders rightfully questioned whether they should be used in clinical practice. But a recent prospective multicenter study confirmed the results of these weaker analyses. German investigators used an open-source ensemble model for detecting melanoma and compared its diagnostic accuracy to that of dermatologists. They relied on a test data set from eight separate hospitals and included four different camera setups and rare skin cancer subtypes. They found that the algorithm outperformed the physicians, with a greater sensitivity. In other words, the model was better able to detect melanoma in patients who really have the cancer, the true positive rate, in contrast to those who might be labeled as having the disease but who really didn’t (94% vs 73%).
An international team of researchers recently examined the impact of two deep learning algorithms on diabetes self-management and referral for diabetic retinopathy screening in low-resource communities. In one real-world prospective experiment, they compared the ability of the algorithm to provide clinicians with individualized recommendations on diabetes care to the performance of unassisted primary care providers. After two weeks, it was obvious the algorithm had an impact; patients were eating fewer refined grains and exercising more. By week four, they were consuming more fresh fruits, fewer starchy vegetables, were doing more blood glucose monitoring, and taking their anti-diabetic medications more often.
To shed light on the discrepancies between positive and negative studies, Adam Rodman, MD, MPH and his colleagues at Beth Israel Medical Center in Boston conducted a unique study in which they asked 50 physicians to use a large language model (ChatGPT Plus, GPT-4) in addition to their usual diagnostic resources, or they were asked to only use their usual diagnostic resources. They were then judged based on standard criteria for diagnostic accuracy. The researchers found no significant differences between the two groups (76% vs 74%, LLM vs conventional resources). But they added a third arm to their experiment, in which they submitted to case vignettes to the LLM only. It turns out the LLM scored 16% higher than the physicians in the conventional resources group. That puzzled the investigators: Why would the algorithm, on its own, yield more accurate diagnoses than either doctors using LLM and doctors who were told not to use it?
When Dr Rodman took a closer look at the conversations between the participants and ChatGPT, he discovered that many of them were not convinced that the chatbot’s recommendations were worth considering when it disagreed with their diagnostic reasoning. “They didn’t listen to AI when AI told them something they didn’t agree with.”
Jonathan Chen, MD, another author of the study, also pointed out that many of the doctors didn’t know how to fully utilize the chatbot’s capabilities: “They were treating it like a search engine for directed questions: ‘Is cirrhosis a risk factor for cancer? What are possible diagnoses for eye pain?'... “It was only a fraction of the doctors who realized they could literally copy-paste in the entire case history into the chatbot and just ask it to give a comprehensive answer to the entire question… Only a fraction of doctors actually saw the surprisingly smart and comprehensive answers the chatbot was capable of producing.”
It's clear that 2025 will be a year in which Generative AI tools will need some measure of quality in order to be trustworthy. In the meantime, there will generally be a human nearby to accept the good and discard the bad recommendations.
Recent Posts
By John Halamka and Paul Cerrato — Mayo Clinic’s Tapestry Study has demonstrated that next generation genetic analysis can have a significant impact on patient care.
By John Halamka and Paul Cerrato — Imagine if you could create a digital clone of yourself that can be used to test various treatment options to determine which one is best for your real self.
By John Halamka and Paul Cerrato — Data scientists use a variety of coding languages to create AI-driven models, but the real “secret sauce” that helps them identify the best algorithms are the weights the coding generates.