Results Summary
What was the project about?
Researchers often have trouble collecting complete information on patient health, as patients may receive care at different places. Linking patient records from different places may help researchers get a more complete picture.
One way to link records is through personal information, such as names and birth dates. But this method increases risks to patient privacy. Another way, known as privacy-preserving record linkage, or PPRL, masks personal information. But current PPRL methods only work when linking entire sets of patient data, including data that have already been shared and linked. Linking entire data sets takes a long time. Also, sharing the same records multiple times increases data privacy risks.
In this study, the research team developed and tested a new PPRL method called incremental PPRL. This method links only new or updated data rather than re-linking entire data sets.
What did the research team do?
The research team developed the new PPRL method based on current methods. They then tested the new method by using it to link test data sets they created. Next, the research team used real patient data to look at how well the new method performed compared to two current record linkage methods that link entire data sets.
The real patient data the research team used included records from 2011 to 2013 from five health systems in the Colorado Congenital Heart Disease registry. The team first linked records for 4,940 patients ages 11–64. They carefully reviewed the linked records to see if they were accurate. Then the team linked the same records using the new method and the two current methods. They compared the linked records from the new and current methods with the data set they had reviewed to see how well each method worked.
Patients, a patient representative, and other researchers gave input on the study.
What were the results?
The new method performed as well as the two current methods in linking patient records. All methods accurately linked records about 97 percent of the time.
The research team made their computer program for the new method available online for free.
What were the limits of the project?
Health systems had a hard time pulling only new or updated records from their data sets to use with the new method. Also, when the team used the new method with large data sets, it was less efficient. The research team only tested the new method with data from one state and one health problem.
Future research could test the new method with patient data for other states and health problems.
How can people use the results?
Researchers can reduce privacy risks by using the new method to link new or updated records with existing data sets.
Professional Abstract
Background
Researchers often have difficulty accessing patients’ entire health histories because health data may be spread over multiple health systems. To improve data completeness and increase data accuracy, researchers link records from different data sources. One method for linking records is to share personally identifiable information, but this method increases risks to patient privacy. Privacy-preserving record linkage (PPRL) methods mitigate this issue by linking encrypted personal information to create patient identifiers that mask personal information. Existing PPRL methods require linking entire sets of patient records instead of only new or updated records. But re-linking entire data sets is inefficient and increases the risk of exposing sensitive patient data.
In this project, the research team wanted to develop a new PPRL method, called incremental PPRL (iPPRL), which links only new or updated records to an existing patient data set.
Objective
(1) To develop and implement a novel iPPRL method; (2) To compare iPPRL with existing linkage methods and validate its accuracy and effectiveness
Study Design
Design Element | Description |
---|---|
Design | Simulation studies; empirical analysis |
Data Sources and Data Sets |
|
Analytic Approach |
|
Outcomes |
Linkage performance metrics: precision, recall, F-score |
Methods
The research team extended existing PPRL methods to develop a new iPPRL method. The method successively linked incremental data sets to an initial data set; linkage ended when no new data could be added. The team applied the iPPRL method to a simulated data set containing 115,000 records that mimicked real-world data quality issues.
Then, using real patient data, the research team compared the performance of the iPPRL method with two existing methods which require re-linking whole data sets. The team first linked data from five health systems in the Colorado Congenital Heart Disease registry. They manually reviewed the linked records to create a reference data set containing 4,940 linked records. Next the team linked the same records using the iPPRL method and the two existing methods. They compared the linkage results from the iPPRL and existing methods with the reference data set.
Patients, a patient representative, and researchers provided input throughout the study.
Results
The new iPPRL method performed as well as the existing bulk linkage methods. The methods had similar precision (0.99), recall (0.94), and accuracy (0.97).
The research team tested the iPPRL method with synthetic data and made their code available online for free. They incorporated the iPPRL method into an existing web-based record linkage platform called CU Record Linkage.
Limitations
Health systems had challenges extracting incremental data from their systems when the incremental data included both new and existing patients, which was a barrier to implementing the iPPRL method. Large incremental data sets reduced the efficiency of the iPPRL method, making it computationally expensive. The research team tested the iPPRL method using patient data from one state and one health condition. Results may differ with data from other states and health conditions.
Conclusions and Relevance
The new iPPRL method worked as well as existing methods. Researchers can use this method to reduce logistical challenges and protect patient privacy when linking data.
Future Research Needs
Future research could test the method with data from other locations and populations.
Final Research Report
This project's final research report is expected to be available by February 2024.
Peer-Review Summary
Peer review of PCORI-funded research helps make sure the report presents complete, balanced, and useful information about the research. It also assesses how the project addressed PCORI’s Methodology Standards. During peer review, experts read a draft report of the research and provide comments about the report. These experts may include a scientist focused on the research topic, a specialist in research methods, a patient or caregiver, and a healthcare professional. These reviewers cannot have conflicts of interest with the study.
The peer reviewers point out where the draft report may need revision. For example, they may suggest ways to improve descriptions of the conduct of the study or to clarify the connection between results and conclusions. Sometimes, awardees revise their draft reports twice or more to address all of the reviewers’ comments.
Peer reviewers commented and the researchers made changes or provided responses. Those comments and responses included the following:
- The reviewers asked for more information about establishing a gold standard dataset for testing their incremental privacy practice record linkage model. The researchers added a paragraph and table to their methods section describing their creation of the synthetic dataset that could be used to establish the gold standard. Responding to the reviewers’ request that the researchers provide data on the usefulness of the gold standard dataset, the researchers explained that producing such metrics was not possible because the size of such an effort would be too big to accomplish.
- The reviewers noted the limited generalizability for the methods developed in this study and asked the researchers to provide some examples of how the methods could be used. The researchers added examples of how their data linkage methods could be used in practice.
- Reviewers requested that the researchers revisit their definition for deterministic linkage as a method for linking two medical records. The researchers revised their definition from stating that the deterministic method requires all variables in the two records to match, to stating that it requires the set of variables used for matching to be identical. They went on to describe different examples of linkage variables that could be used in deterministic methods as well as the advantages and disadvantages of this method.
Conflict of Interest Disclosures
Project Information
Key Dates
Study Registration Information
^This project was previously titled: Incremental Privacy-Preserving Record Linkage to Improve Data Quality