Knowledge Graphs Can Move Healthcare into the Future
The term knowledge graph may not be part of your everyday vocabulary, but if you are interested in digital health, it’s worth taking a closer look.
By John Halamka, M.D., President, Mayo Clinic Platform and Paul Cerrato, MA, senior research analyst and communications specialist, Mayo Clinic Platform.
IBM defines a knowledge graph (KG) as “a network of real-world entities—i.e. objects, events, situations, or concepts—and illustrates the relationship between them. This information is usually stored in a graph database and visualized as a graph structure, prompting the term knowledge graph.” For many clinicians, that’s a somewhat cryptic explanation, which is part of the reason physicians and informatics specialists have a hard time communicating. They speak different languages. A concrete example can help clarify the definition.
In medicine, several domains exist to help researchers and clinicians understand and manage specific diseases. They fall into several large “buckets”, including genomics, transcriptomics, proteomics, molecular biology, exposome, and therapeutics. This fragmented scenario makes it difficult to develop a personalized approach to patient care. Payal Chandak and associates highlight the problem, pointing out that: “A resource that comprehensively describes the relationships of diseases to biomedical entities would enable systematic study of human disease. Understanding the connections between diseases, drugs, phenotypes, and other entities could open the doors for many types of research, including but not limited to the study of phenotyping, disease etiology, disease similarity, diagnosis, treatments, drug-disease relationships, mechanisms of drug action, and resistance, drug repurposing, drug discovery, adverse events, and combination therapies.” In this context, the “entities” referred to above would be all the concepts within these domains, which would be folded into a network that enables users to literally visualize the relationships between them.
The basic structure of a KG consists of nodes, edges, and labels—sometimes called a triple. The KG can divide these three components into subject, predicate, and object. For example, <Bob, is_ interested in, the Mona _Lisa>. In a network designed to bring together all the disparate domains needed to create a personalized medicine database, node types can include phenotypes, exposures, and drugs, and the edges can include a variety of relationships between these nodes. As an example, Chandak et al. illustrate the numerous relationships between autism and the drug risperidone, including drug indications, the drug’s molecular target, and contraindications among autistic patients with epilepsy. Their KG, called PrimeKG, drew from numerous primary data resources, including Mayo Clinic’s knowledge base, Disease Gene Network (DisGeNet), which consists on gene/disease associations, DrugBank, MONDO disease ontology, which was used to define diseases, and OrphanNet, a source of data on rare diseases.
Since developing KGs of this nature requires great time and manpower, it’s only natural to wonder what kind of return on investment we might expect from such initiatives. In the case of PrimeKG, investigators were able to do sophisticated, AI-powered data analytics that would not have been possible any other way. More specifically, among 40 recently FDA approved drugs studied, researchers identified 11 drugs that could be repurposed.
There is also evidence to suggest that comprehensive, finely tuned KGs can play a role in clinical decision support systems, improving work flow and patient care. Currently, there is a great deal of untapped value in electronic health records. They can be used to help identify patterns in patients’ medical history to make more accurate predictions about their clinical course. But much of it is trapped in narrative notes and other hard to access intelligence. KGs are being developed to help utilize EHR data, which can then be inserted more effectively into artificial neural networks and transformer-based algorithms, which in turn can feed predictive analytics systems.
Similarly, KGs exist to help extract diagnosis and procedure codes to generate more accurate ICD coding. Cui et al. explain, for instance, that it’s possible to “inject the label information via structured knowledge graph propagation by leveraging graph convolution networks to learn the correlations among medical codes.” KGs are also being used to improve clinical report summaries and treatment recommendations. Their ability to identify drug/drug interactions alone can have a measurable impact on patient outcomes.
KGs have a special role to play in the world of large language models as well. For instance, they can be used to do a search, in the context of retrieval augmented generation (RAG). LLMs output a KG query, then the KG responds with the triples for a question. LLMs respond back to the user in a sentence as opposed to raw triples. Essentially LLMs take in user input, leverage KG to get a factually correct answer, and reformat the KG response to send to the user in a natural sentence as opposed to cryptic triples. To illustrate this function, imagine a clinician asking an LLM "What are the drugs indicated for non-small-cell lung cancer?” A KG may contain triples of the form "non-small cell lung cancer” -> treated by -> drug1, non-small cell lung cancer —> treated by —> drug2,….. non-small cell lung cancer —> treated by —> drug n”. We can force the LLM to do this with prompt instructions typically done as a json query with appropriate fields. The KG engine responds with the triples (also as json). The LLM formats the triples to the user friendly natural language response “Non-small cell lung cancer is treated by drug1, drug2,… drugn”
Knowledge graphs may not be part of the everyday conversation among stakeholders in digital health, but they have the potential to transform patient care and clinical research.
Recent Posts
By John Halamka, Paul Cerrato, and Teresa Atkinson — Many clinicians are well aware of the shortcomings of LLMs, but studies suggest that retrieval-augmented generation could help address these problems.
By John Halamka and Paul Cerrato — Large language models rely on complex technology, but a plain English tutorial makes it clear that they use math, not magic to render their impressive results.
By John Halamka and Paul Cerrato — Many algorithms only reinforce a person’s narrow point of view, or encourage existing prejudices. There are better alternatives.