In response to the COVID-19 public health crisis in 2020, PCORI launched an initiative to enhance existing research projects so that they could offer findings related to COVID-19. The initiative funded this study and others.
Detailed clinical information about COVID-19 symptoms and disease progression can help researchers understand the illness, compare treatments, and develop public health measures. Clinical notes in electronic health records (EHRs) often contain such information. But manually extracting information from clinical notes is time consuming and not scalable.
Natural language processing (NLP), which can use a combination of symbolic, rules-based, and statistical methods, automates the process of labeling and extracting text from clinical notes. Because COVID-19 is new, NLP-based software programs to extract COVID-19-related information are not readily available. To develop and test the accuracy of NLP methods, researchers need a reference data set of manually annotated relevant terms and phrases.
To develop and test NLP-based software to extract COVID-19-related information from EHR clinical notes
- Development of guidance for annotating EHR clinical notes with COVID-19-related information
- Development of a reference data set of 400 clinical notes annotated with COVID-19 information
- Software development
|EHR data from adult patients from two sites: VUMC (N=200) and MUSC (N=200)
- F1 score for agreement between annotators creating the reference data set
- Open source software program
- F1 score for performance of the software program. This score balances recall and precision. Recall measures the proportion of true positives among all true positives and false negatives in the data. Precision measures the proportion of true positives among all positives identified by the software
|Data Collection Timeframe
|July 2020-March 2021
First, the research team created annotation guidance that defined the COVID-19-related information to label and extract from EHR clinical notes. To develop this guidance, researchers worked with clinical providers, patients who had COVID-19, and an advisory group.
At each site, health professionals, including nurse practitioners, medical residents, and medical students used the guidance to create a reference data set of COVID-19-related information. The health professionals labeled and extracted COVID-19-related information from one EHR clinical note from each patient in a random sample of 200 adult patients at two sites, Vanderbilt University Medical Center (VUMC) and Medical University of South Carolina (MUSC), for a total of 400 notes. All patients had a hospital stay and tested positive for COVID-19 within 14 days before or during their stay.
Concurrently, the research team created and tested DECOVRI, an NLP-based software program, to extract COVID-19-related information from EHR notes. The software was based on an existing prototype developed at MUSC and NLP methods previously developed by the research team. The team used DECOVRI to extract COVID-19-related information from the same 400 notes for patients at the two sites. To test the accuracy of the software, the team compared results from the software against the reference data set developed by health professionals.
A patient and people with expertise in intensive medicine and clinical informatics provided input during the study.
For the reference data set, annotators had high agreement on which terms to include, with an F1 score of 0.77 for annotators at VUMC and 0.80 for annotators at MUSC.
Regarding the performance of the software, the F1 score was 0.72, which indicates that DECOVRI identified most COVID-19 terms from notes and rarely mislabeled terms.
The research team used information from one EHR clinical note from each patient to create the reference data set and test the accuracy of the software. However, a single clinical note may not contain the patient’s entire history of symptoms or treatments. Results may have differed if the reference data set included more notes or more patients.
Conclusions and Relevance
Researchers could consider using DECOVRI software to extract COVID-19-related information from EHRs to better understand the disease and assess treatment efficacy.