Project Summary

When research is done using clinical data from large networks, it is important that the data capture the meaning that the research study needs, a characteristic called semantic data quality. If the semantic data quality is poor, the study may be slowed down, draw the wrong conclusions (for example, mistakenly suggest that a drug is not effective), or not be applicable outside the limited circumstances of the study. However, current methods for testing data quality are limited and abstract. Additionally, they deal primarily with technical requirements like whether the data are missing, and whether values fit into a data model; there is not much existing guidance about how to test clinical meaning.

The project team proposes to create a set of guidelines that outline how to effectively test semantic data quality. The team will start with a framework for data quality assessment (DQA) that integrates clinical context with data quality methods, and will work with a committee of 15 stakeholders, including researchers, informaticians, patients, and policy makers, to develop guidelines and test cases. The team will share the consensus recommendations from this committee with 60 additional stakeholders, to get their feedback on what kinds of testing are most practical and useful to them. The team will take this into account, and create a final set of guidelines.

Once the guidelines are settled, the team will write a set of computer programs that can be easily tailored to execute semantic DQA across a wide variety of use cases. The team hopes that the combination of guidelines and accessible tools will make it easier for people to start doing more comprehensive DQA.

To see how well these tools work, the team will pick five test cases, built on real or hypothetical research questions, and use the guidelines to build DQA tests. The team will share these with three PCORnet data networks who have agreed to run them, and compare the results to see whether differences between networks were found. Because it’s important that DQA results be understandable not just by the researcher, but by everyone using the results, the team will also ask the group of 60 stakeholders to rate the way the data are visualized.

The team will make all guidelines and tools freely available through software sharing sites such as GitHub, and will let people know about them via publishing papers, presenting at meetings, and putting them on collaboration websites. The team will also make testing results available so other people can compare them with their own test results. The team hopes that others will be able to use these resources to create better and more accessible evidence from their studies.

Project Information

Charles Bailey, MD, PhD
Children's Hospital of Philadelphia

Key Dates

July 2021
June 2026


State State The state where the project originates, or where the primary institution or organization is located. View Glossary
Last updated: January 20, 2023