Reexamining the Evidence Behind Evidence-Based Medicine, Part 2

In part one, we discussed the shortcomings of evidence-based medicine and the disconnect between RCTs and bedside clinical care. Part 2 explores possible solutions, including machine learning-based algorithms. 

By Paul Cerrato, MA, senior research analyst and communications specialist, Mayo Clinic Platform, and John Halamka, M.D., Diercks President, Mayo Clinic Platform

To understand the potential contribution of machine learning in evidence-based medicine, it helps to take a closer look at what evidence grading systems like GRADE attempt to accomplish. Typically, they summarize the best evidence on a clinical question, relying heavily on systematic reviews. However, evaluating a body of evidence poses a unique set of challenges. Analysts must examine each study for design limitations, sample size, and various types of bias. And when comparing studies to develop a meta-analysis and systematic review, it is necessary to look for consistency across trials, overlap of effects between studies, and much more.1 Among the possible tasks that might be taken on by AI algorithms are “automation of article classification, screening for primary studies, data extraction, and quality assessment.”1

To explore the possible value of AI in this context, Suster et al. used a subset of data from the Cochrane Database of Systematic Reviews (CDSR), focusing on several data points in specific bodies of evidence, including the population studied, the intervention, and outcomes. These were then used to assign a quality grade of high, moderate, low, or very low using standard GRADE criteria. After this baseline grading, the researchers used their data set to create an artificial neural network to determine if it could automate some or all of the quality assessments normally performed by human analysts. Their machine learning assessment tool, called EvidenceGRADEr, achieved approximately 70% accuracy (>0.7 F1) for evaluating imprecision and the risk of bias (RoB). RoB was based on study design, randomization, blinding, allocation concealment, and related factors. Their algorithm fared poorly in automating other GRADE assessment criteria, however, including inconsistency among studies and publication bias.

At a more basic level, assessing quality of evidence must start with an evaluation of individual RCTs. Just because a large RCT finds that an intervention is unsupported does not mean it has no value in specific subgroups; a GRADE rating of “very low” would have little use in such a situation. Baum et al. illustrate this dilemma and used random forest modeling (RFM) to show how it can be solved. They analyzed a large RCT, the Look AHEAD Trial, which included about 5,000 overweight and obese Type 2 diabetic patients. Half of the cohort received intensive weight loss and exercise programs while the control group received only supportive education.2 The goal of the Look AHEAD trial was to determine if the diet and exercise regimen would reduce deaths from cardiovascular complications, non-fatal myocardial infarction, non-fatal stroke, or hospitalization for angina. The study was terminated early because there were no significant differences between the intervention and control groups. The lower caloric content and increased exercise in the intensive lifestyle group did have a positive impact, helping patients to lose weight, but it did not reduce the rate of cardiovascular events.3

Baum et al. conducted an  in-depth subgroup analysis using RFM (graphically illustrated here.) They constructed a forest that contained 1,000 decision trees and reviewed  84 co-variates that may have been influencing patients’ response or lack of response to the intensive lifestyle modification program. Variables included a family history of diabetes, muscle cramps in legs and feet, a history of emphysema, kidney disease, amputation, dry skin, loud snoring, marital status, social functioning, hemoglobin A1c, self-reported health, and numerous other characteristics that researchers rarely if ever consider when doing a subgroup analysis. The random forest analysis also enabled the investigators to look at how numerous variables interact in multiple combinations to impact clinical outcomes. The original Look AHEAD subgroup analyses looked at only three possible variables and only one at a time.

Baum et al. discovered that intensive lifestyle modification averted cardiovascular events for two subgroups: patients with HbA1c 6.8% or higher (poorly-managed diabetes) and patients with well-controlled diabetes (Hba1c < 6.8%) and good self-reported health. That finding applied to 85% of the entire patient population studied. On the other hand, the remaining 15% who had controlled diabetes but poor self-reported general health responded negatively to the lifestyle modification regimen. The negative and positive responders cancelled each other out in the original Look Ahead statistical analysis, falsely concluding that lifestyle modification was useless. 4

There is also mounting evidence to suggest that large language models (LLMs) may help improve the assessment of evidence quality by automating several aspects of the process. Li et al. compared seven LLMs to evaluate their ability to retrieve evidence, summarize RCTs, and simplify medical text.5 Additionally they used prompt engineering, including chain of thought reasoning and knowledge-guided prompting, to improve the LLM’s ability to accurately collect data. They found that even when they used zero-shot settings, the algorithms demonstrated strong summarization skills and effective collection of knowledge. However, their comparative analysis also revealed significant weaknesses in the LLMs. Their performance fell far short of the performance of PubMedBERT, for instance, and they reported factual inconsistencies and domain inaccuracies.

On the other hand, in a survey study, Lai et al. reported more positive findings when they utilized two LLMs to assess the risk of bias (RoB) in randomized clinical trials.6  Both LLMs produced substantial accuracy and consistency. The researchers developed their RoB prompt for the two LLMs (ChatGPT and Claude) based on a modified version of the RoB tool used by the Cochrane Database and compared the LLMs' results to the assessments performed by three experts in their respective domains. The Cochrane tool evaluates random sequence generation; allocation concealment; blinding to patients, healthcare clinicians, data collectors, outcome assessors, and data analysts; missing outcome data; and selective outcome reporting, among other issues. The RCTs covered three clinical issues: 1) the relationship between red meat and cardiometabolic and cancer, 2) the safety and efficacy of type 2 diabetes treatments, and 3) drug treatment for insomnia. The LLMs’ bias assessment rates were 84.5% and 89.5% respectively, when compared to the human reviewers.  Lai el al. concluded: “The LLMs used in this study produced assessments that were very close to those of experienced human reviewers. Automated tools in systematic reviews exist but are underused due to difficult operation, poor user experience, and unreliable results. In contrast, both LLMs had high accessibility and user friendliness, demonstrating outstanding reliability and efficiency, thereby showing substantial potential for facilitating systematic review production.”

Although LLMs and related AI-based algorithms show promise, one weakness that needs attention is their lack of transparency. Deep neural networks in particular remain “black boxes” to most clinicians, as well as many technologists, making it difficult for users to trust in their conclusions. The mathematical equations and advanced statistics they employ are unfamiliar to anyone who has not been trained in data science. In addition, all LLMs base their conclusions on content from the data set they draw from, using pattern recognition to string together a long series of words and phrases that generate a coherent set of answers to the prompts they receive. But they do not reason in the human sense. Since  LLMs use deep neural models to arrive at their conclusion, Horwitz and Mitchell point out that these models “do not have the ability to quickly learn and adapt as humans to real-time experiences and information. Once they are trained, these models are then applied but typically remain fixed or sometimes they are updated via the traditional long cycle times of fine tuning.” 7 As ChatGPT explains it: LLMs “sometimes mimic reasoning by producing steps that look like human problem-solving. This is typically a pattern-completion process rather than the model having an abstract, conceptual understanding of math in the human sense.” This overreliance on pattern recognition, and access to an unreliable data set,  sometimes results in fabricated answers, so-called hallucinations.

Another weakness of LLMs concerns the way in which many have been evaluated. Bedi et al. examined 519 studies that evaluated LLMs in healthcare and found only 5% actually used real patient data. “Existing evaluations of LLMs mostly focus on accuracy of question answering for medical examinations, without consideration of real patient care data. Dimensions such as fairness, bias, and toxicity and deployment considerations received limited attention.” 8

Patient-Centric Care

A reexamination of EBM should also include a discussion of patient values and preferences. The initial description of EBM stressed the importance of evidence from clinical research, but subsequent definitions of the term included the patient’s values and preferences. That reflects the emphasis on shared decision making that has emerged over the years, but it is also based on sound empirical reasoning and clinical research. Most clinicians would agree that treatment decisions cannot rely solely on statistics, data analysis, and evidence grading, but also require the application of community context, compassion, empathy, and clinical experience.9  Several clinical studies support this perspective. For example, Watts et al. evaluated numerous systematic reviews and meta-analyses and found clinical outcomes were correlated with the degree to which patients perceived their providers' displayed empathy and compassion.10

The debate on how to evaluate clinical research and how to apply it at the bedside will continue for some time, but there’s evidence to suggest that AI algorithms can play a valuable part in resolving the controversy.

References

1. Suster S. et al. Automating Quality Assessment of Medical Evidence in Systematic Reviews: Model Development and Validation Study. J Med. Internet Res. 2023; 25:e35568.

2. Baum A, Scarpa J, Bruzelius E, Tamler R, Basu S, Faghmous J. Targeting weight loss interventions to reduce cardiovascular complications of type 2 diabetes: a machine learning-based post-hoc analysis of heterogeneous treatment effects in the Look AHEAD trial. Lancet Diabetes Endocrinol. 2017;5:808–815.

3. The Look AHEAD Research Group. (2013). “Cardiovascular Effects of IntensiveLifestyle Intervention in Type 2 diabetes.” New England Journal of Medicine,vol. 369, pp. 145–154.

4. Cerrato P, Halamka J. Reinventing Clinical Decision Support. Taylor & Francis, Boca Raton, Fl, 2020.

5. Li J et al. Benchmarking Large Language Models in Evidence-Based Medicine. IEEE J Biomed Health Inform. 2024. PMID: 39437276. https://pubmed.ncbi.nlm.nih.gov/39437276/ Accessed January 6, 2025

6. Lai H et al. Assessing the Risk of Bias in Randomized Clinical Trials With Large Language Models. JAMA Network Open. 2024;7(5):e2412687. doi:10.1001/jamanetworkopen.2024.12687

7. Horwitz E, Mitchell T. Scientific progress in artificial intelligence.: History, status, and futures.  In Realizing the Promise and Minimizing the Perils of AI for Science and the Scientific Community,. Jamiesone K, Mazza, AM, Kearney. W. Eds. 2024, University of Pennslvania Press. P 167

8. Bedi S et al. Testing and Evaluation of Health Care Applications of Large Language Models:  A Systematic Review. JAMA 2025; 333:319-328.

9. Fernandez A et al.  Evidence based medicine: Is It a bridge too far?   Health Research Policy and Systems. 2015; 13:66.

10. Watts E. The role of compassionate care in medicine: Toward improving patients’ quality of care and satisfaction. J Surg Res. 2023; 289:1-7.


Recent Posts