Project Summary

One of PCORI’s goals is to improve the methods that researchers use for patient-centered outcomes research. PCORI funds methods projects like this one to better understand and advance the use of research methods that improve the strength and quality of comparative effectiveness research.

What is the project about?

To learn how well treatments work for a rare disease, researchers first need to identify patients with that disease. One way to identify patients is to search electronic health records, or EHRs. But EHRs often don’t contain consistent data on patients’ diagnoses.

To address this challenge, researchers create sets of markers for a rare disease, such as symptoms or physical traits, to search EHRs. These sets of markers are called computable phenotypes. Standard methods to create computable phenotypes take a lot of time and careful review.

In this study, the research team is developing methods for creating computable phenotypes with machine learning. Machine learning uses data to learn how to perform tasks with little or no human input.

How can this project help improve research methods?

Results may help researchers identify patients with rare diseases to take part in studies.

What is the research team doing?

The research team is developing machine learning methods that use data from a few patients who have a rare disease to create a model. Researchers can then use the model to find other patients with the disease among a larger group of patients within an EHR system. To refine the methods, the team is using EHR data from three healthcare systems.

The research team is putting the methods into a software program for other researchers to use. To test the methods, the team is creating a computable phenotype for a rare disease. Then the team is comparing how well the new and standard methods work to identify patients with that disease.

Research methods at a glance

Design ElementDescription
GoalTo develop machine learning methods that leverage distributed clinical data to construct accurate computable phenotypes
  • Creating multiple models and applying reinforcement learning to enable the models to train each other with minimal active learning from chart review
  • Splitting the automated feature engineering into two serial layers encoding a core, generalizable model layer common to all sites and a site-specific harmonization layer
  • Embedding the machine learning methods into open source software designed to interface with data from the PCORnet clinical research network and allow for specification of method parameters
  • Constructing a computable phenotype for primary aldosteronism

*Methods to Support Innovative Research on AI and Large Language Models Supplement
This study received supplemental funding to build on existing PCORI-funded comparative clinical effectiveness research (CER) methods studies to improve understanding of emerging innovations in large language models (LLMs).

Project Information

Daniel Herman, MD, PhD
Jinbo Chen
University of Pennsylvania Perelman School of Medicine
Development of Methods to Improve Identification of Patients with Rare or Complex Diseases

Key Dates

March 2021
June 2024


State State The state where the project originates, or where the primary institution or organization is located. View Glossary
Last updated: March 14, 2024