Results Summary
What was the project about?
Patients get care at different places, such as doctors’ offices, hospitals, and pharmacies. Linking patients’ data from these places can help researchers get complete information for each patient for use in studying patient health.
Researchers can use two types of methods to help link patient records based on personal data like names and dates of birth. The first type, known as deterministic, is simpler to use. The second type, known as probabilistic, uses complex computer programs that take longer to work and haven’t been studied as much. But when errors exist in the data, probabilistic methods may work better.
In this project, the research team wanted to learn more about the accuracy of deterministic and probabilistic methods for linking patient records.
What did the research team do?
The research team used four data sets of linked records:
- Newborn screening registry records linked to hospital and clinic health records
- Patient records from different hospitals
- Patient records from public health registries
- Patient death records linked to health records
The research team manually reviewed some records from each data set to check if they were linked correctly. Using only the records that were linked correctly, the team created four reference data sets.
Then, the research team used each linkage method to link individual patient records from the reference data sets one more time. The team compared the linkage results from each method to the reference data sets to see which method was more accurate.
Patients, a patient representative, and other community members helped design the study.
What were the results?
Overall, the probabilistic method was more accurate than the deterministic method. It led to better accuracy in all four data sets.
What were the limits of the project?
The research team didn’t check if one method worked better than the other across different racial and ethnic groups.
Future research could look at how the methods work with records from patients of different racial and ethnic backgrounds.
How can people use the results?
Researchers can use probabilistic linkage methods to link patient records.
Professional Abstract
Background
Patients receive care at different locations, such as doctors’ offices, hospitals, and pharmacies, each of which create and store patient health data differently. To obtain comprehensive information on patient healthcare experiences, researchers can use different methods to link patient records based on individual demographic data. Key-based deterministic methods require a strict match between demographic data fields across data sources to accurately link patient data. Probabilistic methods are more flexible, allowing linkage even when discrepancies exist in individual patient data between data sets. But probabilistic methods have not been tested extensively with real-world healthcare data.
Objective
To evaluate and compare probabilistic record linkage methods and key-based deterministic methods for linking patient data
Study Design
Design Element | Description |
---|---|
Design | Empirical analysis |
Data Sources and Data Sets |
|
Analytic Approach |
|
Outcomes |
Linkage performance metrics: sensitivity, positive predictive value, F-score |
Methods
The research team first identified four data sets of linked records:
- Newborn screening records linked to the Indiana Network for Patient Care (INPC)
- Linked hospital registries that share overlapping patient populations
- Linked public health registry data from public health services and lab data
- Death records linked to INPC
The research team randomly selected linked records from the four data sets. They manually reviewed the linked records to confirm they were accurate. Using only accurately linked records, the team created four gold standard data sets that represented four use cases for evaluating deterministic and probabilistic linkage methods: identifying unscreened newborns, linking records from different hospital registries, deduplicating records in a public health registry, and determining death status.
Then, using deterministic and probabilistic linkage methods, the research team linked the same sets of patient records in each use case. The deterministic method used combinations of encrypted identifiers to protect patient privacy. The probabilistic method used the Fellegi-Sunter algorithm to select fields for linking de-identified records. In each use case, the team compared patient records linked through both methods with the gold standard data sets. They compared the sensitivity, positive predictive values, and F-scores of the two methods in each of the four use cases.
Patients, a patient representative, and community members helped design the study.
Results
Overall, the probabilistic linkage method was more accurate than the deterministic method. The probabilistic method showed better sensitivity and F-scores in all four use cases, with highest scores in the newborn screening and public health registry use cases. The deterministic linkage method produced higher positive predictive values in the newborn screening and public health registry use cases.
Limitations
The probabilistic linkage method requires more computation time and effort. The research team did not evaluate the methods for algorithmic bias that may affect linkage results for different racial and ethnic groups.
Conclusions and Relevance
Incorporating probabilistic methods can improve record linkage accuracy for de-identified data that preserves patient privacy.
Future Research Needs
Future research could continue to study probabilistic linkage results for records from different racial and ethnic groups.
Final Research Report
This project's final research report is expected to be available by February 2024.
Peer-Review Summary
Peer review of PCORI-funded research helps make sure the report presents complete, balanced, and useful information about the research. It also assesses how the project addressed PCORI’s Methodology Standards. During peer review, experts read a draft report of the research and provide comments about the report. These experts may include a scientist focused on the research topic, a specialist in research methods, a patient or caregiver, and a healthcare professional. These reviewers cannot have conflicts of interest with the study.
The peer reviewers point out where the draft report may need revision. For example, they may suggest ways to improve descriptions of the conduct of the study or to clarify the connection between results and conclusions. Sometimes, awardees revise their draft reports twice or more to address all of the reviewers’ comments.
Peer reviewers commented and the researchers made changes or provided responses. Those comments and responses included the following:
-
The reviewers noted that while this methods-focused project was executed well, the report was hard to follow and contained a number of concepts and statistical approaches that would not be known to the average scientific audience. The researchers revised their report to include more summaries and examples to better describe the research to scientists who were not experts in this field.
-
The reviewers requested additional information on the engagement activities for this highly technical project. In particular, they asked the researchers to provide more detail about the focus groups they employed to better understand the aspects of privacy-preserving record linkage (PPRL) that patients found most concerning and most important, and to help develop educational materials that could inform other patients about PPRL. The researchers added details about the topics discussed at the focus groups and how they used the findings from these groups.