Developing Statistical Methods for Estimating Patient Phenotypes Using Electronic Health Record Data

Results Summary
Professional Abstract

Results Summary

Download Summary

What was the project about?

Researchers can use data from electronic health records, or EHRs, in studies that compare two or more treatments. In these studies, researchers need to identify all patients with the same phenotype. Phenotypes are a person’s known traits, like height and weight, or known health problems, like diabetes. However, in EHR data, some data on patient traits or health problems may be missing for some patients.

Missing data in EHRs make it hard to correctly identify all patients with the same phenotype. It’s even harder when data are missing due to a patient’s health status. For example, patients with uncontrolled diabetes may need more lab tests than patients with controlled diabetes. As a result, researchers who are looking at lab tests may not identify patients with controlled diabetes as having diabetes.

In this project, the research team developed and tested a new statistical method that accounts for missing EHR data to estimate patient phenotypes.

What did the research team do?

The research team developed the new method. The method combined multiple types of EHR data, such as lab results, diagnostic codes, and patient age and gender to better estimate patient phenotypes. It also combined information on the patterns of missing data and the available data for each patient.

Next the research team tested the method using data created by a computer program. The data looked like real EHR data for 1,000 patients at risk for type 2 diabetes. The team created two types of missing data: data missing by chance and data missing due to a patient’s health status. Then the team looked at how well the method worked compared with four other methods that estimate patient phenotypes.

Doctors, patients, and caregivers helped design the study.

What were the results?

The new method worked better than the other methods to estimate phenotypes. It was 99 percent accurate compared with 94 percent for the next best method.

What were the limits of the project?

The project applied the method to data that met a specific set of conditions. Results may differ for data that don’t meet these conditions.

Future studies can test the methods with data to detect phenotypes that change over time, such as cancer stage.

How can people use the results?

Researchers can use the new method to identify patients with specific health problems for research, even when the EHR has missing data.

Professional Abstract

Background

In comparative effectiveness research, researchers can use data from electronic health records (EHRs) to identify patients based on their observable traits, which is known as phenotyping. Missing data in EHRs due to variations in clinical assessments across patients make it difficult to accurately assign phenotypes. It is especially challenging when data missingness is not random but instead is related to a patient’s health status. For example, patients with uncontrolled diabetes may have more HbA1c lab tests done, while patients with controlled diabetes have missing data for those tests. As a result, patients with controlled diabetes may not be identified as having diabetes in research studies using EHRs.

Existing methods for handling missing data, such as multiple imputation (MI), assume that data are missing at random. When data are not missing at random, using these methods could lead to biased results. A latent variable approach can address gaps in existing methods by combining data that vary in availability across patients. Such methods account for instances when data availability depends on health status, which may improve accuracy in estimating phenotypes from EHRs.

Objective

To develop and evaluate a latent class model for estimating phenotypes using EHR data

Study Design

Design Element	Description
Design	Statistical modeling, simulation studies
Data Sources and Data Sets	Simulated data resembling a cohort of 1,000 patients at risk for type 2 diabetes
Analytic Approach	Bayesian latent class modeling, comparison of performance of latent phenotype with different combinations of computable phenotype classifications and multiple imputations
Outcomes	Predictive performance of each method, sensitivity, specificity, proportion of patients misclassified relative to the true diagnosis

Methods

The research team developed a Bayesian latent class model for predicting a patient’s phenotype. The model combined information on data availability and observed data values for each patient to estimate a latent, or unobserved, phenotype. The model assumed that the latent phenotype was correlated with model covariates, like biomarkers, clinical diagnosis codes, prescription medications, age, and gender.

The research team then simulated EHR data for 1,000 patients to resemble a sample of patients at high risk for type 2 diabetes. They introduced two patterns of missing data in the biomarkers: missing at random and missing not at random. The team evaluated the model using the simulated data. They compared the model’s performance with existing phenotype estimation methods based on (1) biomarkers only, (2) clinical codes only, (3) biomarkers and clinical codes, (4) biomarkers with missing values replaced via MI, and (5) biomarkers and clinical codes with missing biomarker values replaced via MI. For each method, the team calculated sensitivity, specificity, and the proportion of patients misclassified relative to an actual type 2 diabetes diagnosis.

Clinicians, patients, and caregivers helped design the study.

Results

The latent class model performed better than existing methods with and without MI. Sensitivity was similar to existing methods (95.9% and 91.9%, respectively) and specificity was higher (99.7% and 90.8%, respectively). Mean classification accuracy was 99% compared with 94% for the next best method that used biomarkers with MI.

Limitations

The latent class model may not work when the parametric assumptions about data distributions and relationships among covariates are not met.

Conclusions and Relevance

The latent class model can help characterize phenotypes when EHRs have missing data associated with patient health status.

Future Research Needs

Future research can apply the methods to phenotypes that vary over time, such as cancer stage.

Final Research Report

View this project's final research report.

Journal Citations

Related Journal Citations

Peer-Review Summary

Peer review of PCORI-funded research helps make sure the report presents complete, balanced, and useful information about the research. It also assesses how the project addressed PCORI’s Methodology Standards. During peer review, experts read a draft report of the research and provide comments about the report. These experts may include a scientist focused on the research topic, a specialist in research methods, a patient or caregiver, and a healthcare professional. These reviewers cannot have conflicts of interest with the study.

The peer reviewers point out where the draft report may need revision. For example, they may suggest ways to improve descriptions of the conduct of the study or to clarify the connection between results and conclusions. Sometimes, awardees revise their draft reports twice or more to address all of the reviewers’ comments.

Peer reviewers commented and the researchers made changes or provided responses. Those comments and responses included the following:

The reviewers lauded the researchers’ work to better predict patient diagnoses and other traits to account for missing information in the electronic health record. The reviewers asked that the researchers provide more information about the reasoning behind the Bayesian approaches developed in this study. The researchers responded by adding results of several comparable methods to their report to demonstrate the superiority of the Bayesian methods.
The reviewers asked how useful this work would be for clinical research. The researchers responded that the results of this study and their use of simulation studies like the work completed in this research have the potential to be very useful for clinical research. However, their approach must first be compared to a gold standard, even though they also note that one does not currently exist.

Conflict of Interest Disclosures

View the COI Disclosure Form

Project Information

Principal Investigator Principal Investigator The lead researcher and primary contact for the project. View Glossary:

Rebecca Hubbard, PhD

Organization Organization The institution/organization in which the project originates, or the primary institution or organization that received funding for the project. View Glossary:

University of Pennsylvania

Project Budget:

$1,059,133

DOI - Digital Object Identifier:

10.25302/03.2021.ME.151132666

Project Title Project Title The original title of the project supplied by the principal investigator or project lead/team. View Glossary:

Statistical Methods for Phenotype Estimation and Analysis Using Electronic Health Records

Key Dates

Approval Date Approval Date The date of approval to fund by PCORI. The actual project start dates vary as the negotiation of project milestones must be completed before the contract can be fully executed. View Glossary:

July 2016

Project End Date Project End Date Includes the research project period and may be subject to modification to allow other research-related activities such as peer review. View Glossary:

July 2021

Year Awarded Year Awarded The year that funding for the project was approved, or the year the proposal received a notice of award. View Glossary:

2016

Year Completed:

2021

Study Registration Information

HSR Project Number:

HSRP20164097

About

Research

Impact

Highlights of PCORI-Funded Research Results

Topics

Engagement

Funding Opportunities

Applicant and Awardee Resources

Events

Jump to Section