How to Interpret Medical Research
Our primer will give you many of the cognitive skills needed to understand the latest research on digital health.

By Paul Cerrato, MA, senior research analyst and communications specialist and John Halamka, M.D., M.S., Diercks President, Mayo Clinic Platform
One of the things that separates medical research from content found on many social media sites is its emphasis on critical thinking. While many popular sites will post hearsay and conjecture, research journals rely on hypothesis testing, statistical analysis, and review by experts. It’s only natural for many readers to gravitate toward social media because it’s much easier to read—and because it often agrees with their preconceived ideas about how the world works. Struggling through the technical language in a medical or scientific paper requires much more effort—and the courage to challenge one’s own deeply held beliefs. The purpose of this primer is to explain many of the concepts and terms in the medical literature to make the journey easier.
Before we dive into terms like sensitivity, specificity area under the curve P values, statistical significance, and positive predictive value, it is helpful to understand the types of journal articles. Among the “buckets” you’re likely to encounter are observational and interventional studies, retrospective and prospective studies, case control and cohort studies, and randomized controlled clinical trials. The two broad categories are observational and interventional. As the name implies, during an observational study researchers only watch and learn; they don’t add or subtract anything like a medication or dietary changes. Case control studies fall into this category. Interventional studies, on the other hand, involve having a researcher add some sort of treatment into the mix. They might be open clinical experiments, in which the patients know what they are receiving, or a blind trial, in which they do not. In a randomized double-blind trial neither the patient nor the investigators know what treatment is being given. That reduces the likelihood of either group being influenced by the expectation of what effect the intervention may have.
Case control studies are observational and retrospective, which means they look back in time to see what has happened to a group of patients—the cases—comparing it to what has happened to a similar group of healthy persons, the controls. They might compare patients with lung cancer to those without to see if their intake of a certain nutrient was different. And lastly, cohort studies are perspective and observational.
Another difference between research studies and many popular online articles is researchers typically begin with hypothesis testing. If they have a hunch on what causes a disease or how to prevent it, they see the need to test that hunch. They begin with what’s referred to as the null hypothesis, which states that your premise is wrong and that there’s a need to prove it’s correct with strong evidence. If you determine that there’s a difference in the rate of diarrhea between patients who take antibiotic X and antibiotic Y, and the difference is large enough, it would be labeled statistically significant. A P value of 0.05 indicates that the difference between the two drugs was due to chance is 1 in 20.
Sensitivity and specificity are especially important terms to understand in the context of AI algorithms. Sensitivity is the percentage of true positives. If, for example, a model claims it can accurately diagnose melanoma, it has to be tested by comparing it to the diagnostic skills of expert dermatologists—referred to as ground truth. If the algorithm detects 8 out of 10 skin cancers when compared to the correct diagnosis rendered by the clinicians, its sensitivity is 0.8, which also means it is telling two patients with the disease that they don’t have it. Specificity is the percentage of true negatives. A rating of 0.9 means it correctly identifies 9 out of 10 as being free of the cancer.
Although sensitivity and specificity are useful statistics, they only tell part of the story about a diagnostic test’s value. Take the skin cancer algorithm for example; with a sensitivity of 0.8 we know its true positive and false negative rates, but what we don’t know is false positives, which is even more important for a clinician deciding whether to use it or not. The false positives would be the number of patients who are told they have the disease but don’t. In a typical medical practice, the physician wants to know about all the patients who test positive. The number of true positives divided by total number of true positive and false positive results yields the positive predictive value (PPV), a more helpful statistic when making clinical decisions. Put another way, PPV is the proportion of patients with a positive test who are diseased.
Another cryptic term you will likely encounter when reading research papers is 'area under the curve' (AUC), sometimes called the receiver operator characteristic curve (ROC). The need for this statistic results from the fact that sensitivity and specificity tend to work against one another—one goes up when the other goes down. To deal with this problem, AUC combines both into one calculation. Using an x-y graph, it plots 'sensitivity' on the Y against '1 minus specificity' on the X axis. The resulting curve shows the most accurate diagnostic rating at various cut off points. When the AUC is 8 or above, the test is usually considered valuable.
If you’ve spent any time studying the biomedical literature, you know that these reports are not infallible. Nonetheless, you’re far more likely to learn the truth from these sources than from the typical social media site.
This brief look at biostatistics only skims the surface. We hope to discuss other basic concepts in future columns.
Recent Posts

By John Halamka and Paul Cerrato—Large language models like ChatGPT are finding their way into healthcare, business, and everyday life. The study of mechanistic interpretability may help us create safer, more trustworthy algorithms by “looking under the hood.”

These digital tools have the potential to transform patient care by tapping data resources rarely used in routine medical practice. By John Halamka, M.D., Diercks[...]

By John Halamka and Paul Cerrato—NLP enables humans and computers to communicate in ways never imagined a few short years ago. The results have practical implications for anyone working in healthcare.