Results Summary

What was the project about?

Researchers can use data from electronic health records, or EHRs, in studies that compare two or more treatments. In these studies, researchers need to identify all patients with the same phenotype. Phenotypes are a person’s known traits, like height and weight, or known health problems, like diabetes. However, in EHR data, some data on patient traits or health problems may be missing for some patients.

Missing data in EHRs make it hard to correctly identify all patients with the same phenotype. It’s even harder when data are missing due to a patient’s health status. For example, patients with uncontrolled diabetes may need more lab tests than patients with controlled diabetes. As a result, researchers who are looking at lab tests may not identify patients with controlled diabetes as having diabetes.

In this project, the research team developed and tested a new statistical method that accounts for missing EHR data to estimate patient phenotypes.

What did the research team do?

The research team developed the new method. The method combined multiple types of EHR data, such as lab results, diagnostic codes, and patient age and gender to better estimate patient phenotypes. It also combined information on the patterns of missing data and the available data for each patient.

Next the research team tested the method using data created by a computer program. The data looked like real EHR data for 1,000 patients at risk for type 2 diabetes. The team created two types of missing data: data missing by chance and data missing due to a patient’s health status. Then the team looked at how well the method worked compared with four other methods that estimate patient phenotypes.

Doctors, patients, and caregivers helped design the study.

What were the results?

The new method worked better than the other methods to estimate phenotypes. It was 99 percent accurate compared with 94 percent for the next best method.

What were the limits of the project?

The project applied the method to data that met a specific set of conditions. Results may differ for data that don’t meet these conditions.

Future studies can test the methods with data to detect phenotypes that change over time, such as cancer stage.

How can people use the results?

Researchers can use the new method to identify patients with specific health problems for research, even when the EHR has missing data.

Final Research Report

View this project's final research report.

Peer-Review Summary

Peer review of PCORI-funded research helps make sure the report presents complete, balanced, and useful information about the research. It also assesses how the project addressed PCORI’s Methodology Standards. During peer review, experts read a draft report of the research and provide comments about the report. These experts may include a scientist focused on the research topic, a specialist in research methods, a patient or caregiver, and a healthcare professional. These reviewers cannot have conflicts of interest with the study.

The peer reviewers point out where the draft report may need revision. For example, they may suggest ways to improve descriptions of the conduct of the study or to clarify the connection between results and conclusions. Sometimes, awardees revise their draft reports twice or more to address all of the reviewers’ comments. 

Peer reviewers commented and the researchers made changes or provided responses. Those comments and responses included the following:

  • The reviewers lauded the researchers’ work to better predict patient diagnoses and other traits to account for missing information in the electronic health record. The reviewers asked that the researchers provide more information about the reasoning behind the Bayesian approaches developed in this study. The researchers responded by adding results of several comparable methods to their report to demonstrate the superiority of the Bayesian methods.
  • The reviewers asked how useful this work would be for clinical research. The researchers responded that the results of this study and their use of simulation studies like the work completed in this research have the potential to be very useful for clinical research. However, their approach must first be compared to a gold standard, even though they also note that one does not currently exist.

Conflict of Interest Disclosures

Project Information

Rebecca Hubbard, PhD
University of Pennsylvania
Statistical Methods for Phenotype Estimation and Analysis Using Electronic Health Records

Key Dates

July 2016
July 2021

Study Registration Information


Has Results
Health Conditions Health Conditions These are the broad terms we use to categorize our funded research studies; specific diseases or conditions are included within the appropriate larger category. Note: not all of our funded projects focus on a single disease or condition; some touch on multiple diseases or conditions, research methods, or broader health system interventions. Such projects won’t be listed by a primary disease/condition and so won’t appear if you use this filter tool to find them. View Glossary
State State The state where the project originates, or where the primary institution or organization is located. View Glossary
Last updated: April 4, 2024