Project Summary

This research project is in progress. PCORI will post the research findings on this page within 90 days after the results are final.

One of PCORI’s goals is to improve the methods that researchers use for patient-centered outcomes research. PCORI funds methods projects like this one to better understand and advance the use of research methods that improve the strength and quality of comparative effectiveness research.

What is the project about?

Many factors in patients’ lives can affect their health. For example, social factors like education or income and behavioral factors like smoking affect patients’ cancer risk. Electronic health records, or EHRs, include useful information about these factors, often in doctors’ notes. But researchers can’t easily study information from these notes because notes aren’t written in a standard way.

In this study, the research team is creating methods to identify social, behavioral, and clinical factors in doctors’ notes for use in research. The new methods use natural language processing, or NLP. In NLP, computer programs interpret written language and make it easier to sort and study.

How can this project help improve research methods?

Results may help researchers identify social, behavioral, and clinical factors in doctors’ notes.

What is the research team doing?

First, the research team is developing NLP methods to improve how the methods account for shortened phrases that clinicians may use in place of longer medical terms.

Second, the research team is developing methods to combine medical knowledge with statistical NLP methods. With these methods, NLP systems can use statistical techniques to manage phrases that occur often in the text and draw on medical knowledge for phrases that don’t occur often.

Third, the research team is creating a new NLP software package called SODA. SODA provides researchers with a way to use the new methods to identify social, behavioral, and clinical factors in EHR notes. It also organizes and stores the information so researchers can study these factors. The research team is linking SODA with existing NLP systems. At sites in Florida and New York City, the team is testing SODA to see how well SODA sorts cancer risks and patient traits in lung cancer screening.

Research methods at a glance

Design Elements Description
  1. Develop ontologies, corpora, and NLP methods to extract social and behavioral determinants of health and adverse events with improved handling of abbreviations related to medical concepts
  2. Develop methods to integrate medical knowledge with statistical NLP methods
  3. Develop and disseminate an NLP software package called SODA, which extracts, standardizes, and populates social and behavioral determinants of health, adverse events, and clinical factors from doctors’ notes
Approach NLP

Project Information

Yonghui Wu, PhD
University of Florida
Natural Language Processing to Connect Social Determinants and Clinical Factors for Outcomes Research

Key Dates

August 2019
October 2024

Study Registration Information


Health Conditions Health Conditions These are the broad terms we use to categorize our funded research studies; specific diseases or conditions are included within the appropriate larger category. Note: not all of our funded projects focus on a single disease or condition; some touch on multiple diseases or conditions, research methods, or broader health system interventions. Such projects won’t be listed by a primary disease/condition and so won’t appear if you use this filter tool to find them. View Glossary
State State The state where the project originates, or where the primary institution or organization is located. View Glossary
Last updated: March 14, 2024