Christophe Lambert, University of New Mexico Health Sciences Center
Stakeholder
Health Researcher
The draft policy does not address commercial and private administrative claims and electronic health record (EHR) patient databases, used in observational research for comparative safety and effectiveness of treatments (including PCORI-funded CER research). The license restrictions of most administrative claims and electronic health records data providers prevent dissemination of individual level patient data (despite these data sources being deidentified). This is in part to protect patients from determined attempts at reidentification by those who have not signed legal documents promising not to do so, and in part to protect commercial interests in proprietary data that was collected at great expense.
It should be stated that access to such data sources may be acquired by qualified researchers through their institutions with appropriate license fees, allowing reproducibility of findings, particularly if researchers make available all of their scripts for extraction of cohorts and variables from these datasets, and their scripts for performance of analysis (e.g. R code, SAS code, etc.).
We propose that in the case of EHR and administrative claims research that researchers be required to make available their code for performing the study that would enable reproducibility by someone with access to the source data from the data provider, but that they not be required to post the individual level data.
In addition, it is reasonable for the researcher to archive the individual level data and analysis scripts required to reproduce the study. However, because researchers do not own the commercial/private data, they are not in a position to bargain about the license terms, nor how they may share the commercial/private individual level data. If the requirement was made that all individual level data had to be open, this would kill large-scale observational research on EHR and administrative claims data. Only institutions who owned their own data could potentially participate in research that required disclosure of individual patient data, limiting sample size and generalizability. Even then, high quality research requires very specific patient data to be characterized and included in models (for example the information in doctors’ notes), whereby a conflict would arise between model quality and patient privacy if very fine-grain covariates were used in the models and had to be openly released.
We are developing methods (not funded by PCORI) to enable release of cluster level data that would both protect individual identity and enable reproducibility of research, but these technologies are still too new to set policy around.
I would draw attention to the Observational Health Data Sciences and Informatics (http://www.ohdsi.org) effort to create a framework for representing a study in both human readable, and machine-readable form, where the source code for a study can be used not only to reproduce a study on the original source data, but also be used to enable replication of a study on additional data sources that have been mapped to the OMOP common data model. Currently over 630 million patients’ records worldwide exist in various repositories in the OMOP common data model, and multi-country studies are beginning to be run over a distributed research network. For example, see: Hripcsak G, et al. (2016) Characterizing treatment pathways at scale using the OHDSI network, Proceedings of the National Academy of Sciences, doi:10.1073/pnas.1510502113. In that article, 250 million patients’ data were analyzed across global databases from the USA, Japan, South Korea, and elsewhere. Each organization runs the same analysis protocol, and serves up summary results, without the requirement of distribution of individual level patient data.
We are on the cusp of enabling global reproducible comparative effectiveness research on over a half billion patients' data -- without the need for disclosure of individual level patient data. The requirement that hundreds of millions of patients' individual level data be deidentified and posted somewhere would be unreasonable given current concerns about patient privacy and the interests of private data holders.
It should be stated that access to such data sources may be acquired by qualified researchers through their institutions with appropriate license fees, allowing reproducibility of findings, particularly if researchers make available all of their scripts for extraction of cohorts and variables from these datasets, and their scripts for performance of analysis (e.g. R code, SAS code, etc.).
We propose that in the case of EHR and administrative claims research that researchers be required to make available their code for performing the study that would enable reproducibility by someone with access to the source data from the data provider, but that they not be required to post the individual level data.
In addition, it is reasonable for the researcher to archive the individual level data and analysis scripts required to reproduce the study. However, because researchers do not own the commercial/private data, they are not in a position to bargain about the license terms, nor how they may share the commercial/private individual level data. If the requirement was made that all individual level data had to be open, this would kill large-scale observational research on EHR and administrative claims data. Only institutions who owned their own data could potentially participate in research that required disclosure of individual patient data, limiting sample size and generalizability. Even then, high quality research requires very specific patient data to be characterized and included in models (for example the information in doctors’ notes), whereby a conflict would arise between model quality and patient privacy if very fine-grain covariates were used in the models and had to be openly released.
We are developing methods (not funded by PCORI) to enable release of cluster level data that would both protect individual identity and enable reproducibility of research, but these technologies are still too new to set policy around.
I would draw attention to the Observational Health Data Sciences and Informatics (http://www.ohdsi.org) effort to create a framework for representing a study in both human readable, and machine-readable form, where the source code for a study can be used not only to reproduce a study on the original source data, but also be used to enable replication of a study on additional data sources that have been mapped to the OMOP common data model. Currently over 630 million patients’ records worldwide exist in various repositories in the OMOP common data model, and multi-country studies are beginning to be run over a distributed research network. For example, see: Hripcsak G, et al. (2016) Characterizing treatment pathways at scale using the OHDSI network, Proceedings of the National Academy of Sciences, doi:10.1073/pnas.1510502113. In that article, 250 million patients’ data were analyzed across global databases from the USA, Japan, South Korea, and elsewhere. Each organization runs the same analysis protocol, and serves up summary results, without the requirement of distribution of individual level patient data.
We are on the cusp of enabling global reproducible comparative effectiveness research on over a half billion patients' data -- without the need for disclosure of individual level patient data. The requirement that hundreds of millions of patients' individual level data be deidentified and posted somewhere would be unreasonable given current concerns about patient privacy and the interests of private data holders.