Identifying the Best De-Identification Protocols
Keeping patient data private remains one of the biggest challenges in healthcare. A recently developed algorithm from nference is helping address the problem.
John Halamka, M.D., president, Mayo Clinic Platform, and Paul Cerrato, senior research analyst and communications specialist, Mayo Clinic Platform, wrote this article.
In the United States, healthcare organizations that manage or store personal health information (PHI) are required by law to keep that data secure and private. Ignoring that law, as spelled out in the HIPAA regulations, has cost several providers and insurers millions of dollars in fines, and serious damage to their reputations. HIPAA offers 2 acceptable ways to keep PHI safe: Certification by a recognized expert and the Safe Harbor approach, which requires organizations to hide 18 identifiers in patient records so that unauthorized users cannot identify patients. At Mayo Clinic, however, we believe we must do more.
In partnership with the data analytics firm nference, we have developed a de-identification approach that takes patient privacy to the next level, using a protocol on EHR clinical notes that includes attention-based deep learning models, rule-based methods, and heuristics. Murugadoss et al explain that “rule-based systems use pattern matching rules, regular expressions, and dictionary and public database look-ups to identify PII [personally identifiable information] elements.” The problem with relying solely on such rules is they miss things, especially in an EHR’s narrative notes, which often use non-standard expressions, including unusual spellings, typographic errors and the like. Such rules also consume a great deal of time to manually create. Similarly, traditional machine learning based systems, which may rely on support vector machine or conditional random fields, have their shortcomings and tend to remain reliable across data sets.
The ensemble approach used at Mayo includes a next generation algorithm that incorporates natural language processing and machine learning. Upon detection of PHI, the system transforms detected identifiers into plausible, though fictional, surrogates to further obfuscate any leaked identifier. We evaluated the system with a publicly available dataset of 515 notes from the I2B2 2014 de-identification challenge and a dataset of 10,000 notes from Mayo Clinic. We compared our approach with other existing tools considered best-in-class. The results indicated a recall of 0.992 and 0.994 and a precision of 0.979 and 0.967 on the I2B2 and the Mayo Clinic data, respectively.
While this protocol has many advantages over older systems, it’s only one component of a more comprehensive system used at Mayo to keep patient data private and secure. Experience has shown us that de-identified PHI, once released to the public, can sometimes be re-identified if a bad actor decides to compare these records to other publicly available data sets. There may be obscure variants within the data that humans can interpret as PHI but algorithms will not. For example, a computer algorithm expects phone numbers to be in the form area code, prefix, suffice i.e. (800) 555-1212. What if a phone number is manually recorded into a note as 80055 51212? A human might dial that number to re-identify the record. Further we expect dates to be in the form mm/dd/yyyy. What if a date of birth is manually typed into a note as 2104Febr (meaning 02/04/2021)? An algorithm might miss that.
With these risks in mind, Mayo Clinic is using a multi-layered defense referred to as data behind glass. The concept of data behind glass is that the de-identified data is stored in an encrypted container, always under control of Mayo Clinic Cloud. Authorized cloud sub-tenants can be granted access such that their tools can access the de-identified data for algorithm development, but no data can be taken out of the container. This prevents prevents merging the data with other external data sources.
At Mayo Clinic, the patient always comes first, so we have committed to continuously adopt novel technologies that keep information private.
Recent Posts
By John Halamka, Paul Cerrato, and Teresa Atkinson — Many clinicians are well aware of the shortcomings of LLMs, but studies suggest that retrieval-augmented generation could help address these problems.
By John Halamka and Paul Cerrato — Large language models rely on complex technology, but a plain English tutorial makes it clear that they use math, not magic to render their impressive results.
By John Halamka and Paul Cerrato — Many algorithms only reinforce a person’s narrow point of view, or encourage existing prejudices. There are better alternatives.