Project Summary

The ability to accurately identify specific groups of patients is a key ingredient in applying modern information technology to improve the practice of medicine. In a health system, if providers can identify patients with a specific disease, they can implement programs to systematically offer appropriate care to those patients. In a clinical research network, if we can identify specific patient cohorts at scale, we can more efficiently trial new treatments. To locate these patients, investigators formulate a set of criteria, known as a computable phenotype, that can be used to search for patients in clinical data.

Unfortunately, because health records are not designed for this purpose, constructing a computable phenotype that identifies most appropriate patients but does not identify many inappropriate patients requires considerable time, expertise, and troubleshooting. In this project, we propose to develop innovative machine learning methods, and incorporate them into software, which will largely automate the construction of such computable phenotypes. These methods will learn from a small set of target patients, but do not require a definitive set of nontarget patients, which should enable them to perform particularly well for diseases in which many affected patients are unaware that they have the disease. These methods will learn from many clinical practices’ data without the need to share individuals’ data—something that should enable them to learn how to find patients with rare conditions. Moreover, the methods will be configurable to ensure that resulting computable phenotypes are equitable across diverse patients.

Finally, these methods will train pairs of complementary models to teach each other and enable the methods to adapt to new practice sites with only a small amount of external input from clinical experts. We will test the methods by building a computable phenotype for an underdiagnosed, undertreated cause of high blood pressure, primary aldosteronism. We will compare the fraction of subjects that this computable phenotype calls positive that are truly affected to that of conventionally trained computable phenotypes. We hope the tools developed in this project will become widely used in identifying a variety of target patient groups across many clinical practices, thereby catalyzing efforts to systematically trial and implement new approaches for precision medicine.

Project Information

Daniel Herman, MD, PhD
Jinbo Chen
University of Pennsylvania Perelman School of Medicine

Key Dates

March 2021
June 2024


Research Priority Area
State State The state where the project originates, or where the primary institution or organization is located. View Glossary
Last updated: March 15, 2022