Can AI Help Review the Medical Literature?

Healthcare professionals rely heavily on their clinical experience, supplemented by what they read in the medical literature. Several developers are trying to add AI algorithms into the mix.

By John Halamka, M.D., Diercks President, Mayo Clinic Platform and Paul Cerrato, MA, senior research analyst and communications specialist, Mayo Clinic Platform

In a recent column, we wrote about how to interpret the medical literature and discussed the various types of reports you are likely to see. We left out one important type of report: the literature review, including narrative and systematic reviews and meta-analyses. And since the theme of our weekly Mayo Clinic Platform column is the digital health frontier, it makes sense to ask whether AI can assist in creating such reviews.

Literature reviews are invaluable for busy clinicians because they sum up the most important research studies in a relatively short article, saving them the long hours needed to sift through the literature themselves. They usually fall into two categories. Narrative reviews may discuss a wide variety of topics but “have no explicit criteria for selecting the included studies, do not include systematic assessments of the risk of bias associated with primary studies, and do not provide quantitative best estimates of rate of confidence in these estimates.” Systematic reviews, as the name implies, take into account these issues in the hope that they will generate a more trustworthy source of information for physicians and nurses to use at the bedside. These latter reviews may also include a meta-analysis, which combines the results of many studies in an attempt to increase the total number of patients being analyzed. The goal is to achieve “more precise estimates of rates or risks… and less likelihood of a Type II error. Combining data from several smaller clinical trials on the same subject may reveal a clinically important difference in treatment that the smaller trials lacked the power to detect.” (A Type II statistical error occurs when a trial incorrectly concludes that a diagnostic approach or treatment option has no value because the patient population being studied was too small to detect its value.) Although some critics question whether it is reasonable to combine the results of many separate trials that included patients with very different characteristics, nonetheless meta-analyses have become popular in the medical press.

Several developers have created AI tools to help generate said reviews. One way to approach this issue is rely on a large language model that uses a carefully selected data set that is more focused on healthcare and that is generally more trustworthy. One of the problems with popular LLMs like ChatGPT and Claude is they use the entire internet as their data set, including all of its misinformation. LLMs like Consensus and OpenEvidence, on the other hand, use more reliable information sources. That increases the likelihood that a data search will generate useful research studies to include in one’s literature review.

Unfortunately this solves only part of the problem. The list of potential stumbling blocks to creating a reliable literature review includes possible bias, the inability to access important research papers behind a paywall, and a lack of critical analysis of the references being captured. Advantages, however, may include the ability to quickly sum up the content of previously published studies. AI systems like Elicit even claims they “can also extract insights from different sections of papers — the methods, conclusions and so on.”

To test the ability of AI systems to speed up the evaluation of source material for possible inclusion in a systematic review, Sanne van Dijik and associates used an open source AI tool called ASReview. They used it to screen the titles and abstracts from thousands of published studies in the hope that it might save authors the many hours required to manually read all the articles and determine which ones were of high enough quality to include in a systematic review. The process of training the AI tool so that it was capable of screening all the articles was itself quite labor intense. They used a “researcher-in-the-loop” machine learning algorithm to rank articles as high or low quality and also had to train the algorithm to recognize the difference between relevant and irrelevant papers. In the final analysis, van Dijik et al found that ASReview did save a significant amount of time, when compared to manual review of the selected source material. However, whether that time saved outweighed all the time invested in training the algorithm is debatable.

Although technology is currently evolving, at the moment there is no digital tool to automatically generate reviews of the medical literature with the same quality and accuracy as those assembled by humans today.  In the short term, we’ll see tools to augment human author capabilities, and in the long term we may see fully automated methods. We’ll keep you posted as the state of the art moves forward.


Recent Posts