De-clunking the dbGaP Data Submission and Access Process – We’re All Ears!

Data. It is the essential output of biomedical research that allows us to move science forward and improve human health. It gets a little trickier however when the conversation turns to how to best provide researchers with access to that data. Especially when you’re trying to balance appropriate protections for human participants in research, who deserve both the maximal use of their data for achieving medical progress and the respectful use of their data in a way that affords privacy protections and consistency with consent. At the NIH, lots of smart people spend a lot of time thinking about human data and how best to manage it. However, we can’t do it alone. We also need help from our stakeholders to solve these difficult issues.

Back in 2007, the National Center for Biotechnology Information (NCBI) developed the database of Phenotypes and Genotypes (dbGaP) to archive and distribute the results of human genome-phenotype studies that fall under NIH’s policies for sharing genomic data.

The dbGaP is a controlled-access data repository and currently serves as a central portal to submit, locate and request access to genomic and associated phenotypic data. It is a highly utilized, valuable, and rapidly growing resource with over 750 studies available for access. Users of dbGaP have access to a wide range of data types such as microarray, genome-wide association study, whole and targeted genomic, transcriptomic, epigenomic, and metagenomic data. As of January 2017, NIH has approved approximately 28,000 Data Access Requests for over 4,500 investigators from 46 countries.

Over the years, users of the dbGaP system have shared their feedback, and many have expressed a number of frustrations relating to the difficulty in navigating the submission process. To address these concerns, NIH has made a number of improvements to dbGaP (see Box 1). To best serve the needs of the research community and enable robust and responsible data sharing, it is imperative that new resources, tools, and data management models be developed to make the system as user-friendly and efficient as possible, as well as increase its utility.

With this in mind, NIH released today a Request for Information (RFI) seeking public comments on the data submission and access processes for dbGaP, and on the management of data within dbGaP, in order to consider options to improve and streamline these processes.

It is vital that we hear from members of the research community on this topic. We want to take your thoughts and ideas into account when attempting to increase the utility of dbGaP. I invite all stakeholders who currently use or may use dbGaP to provide us with their thoughts. Comments will be accepted until April 7, 2017.

                                                   

Box 1: Recent Improvements/Upgrades to dbGAP  

  • Development of standard data use limitations to promote consistent implementation of the consent group categories.
  • Development of fillable Institutional Certification forms to standardize and expedite the Institutional Certification process for institutions.
  • Implementation of user-friendly, electronic study registration, submission, DAR, project renewal, and project close-out forms.
  • Development of the dbGaP Data Browser to enable viewing of controlled-access summary statistics and individual-level genotype and sequence data associated with phenotypic features, by dbGaP approved users, without the need to download datasets.
  • In collaboration with the Global Alliance For Genomics and Health Beacon project, implementation of a simple web interface that allows users to query dbGaP for genomic variants of interest and their presence in the database.
  • Issuance of a Position on the Use of Cloud Computing Services for Storage and Analysis of Controlled-Access Data Subject to the NIH Genomic Data Sharing Policy to allow investigators to request permission to transfer controlled-access genomic data and other associated data obtained from dbGaP to public or private cloud systems for storage and analysis.
  • Creation of search filters for dbGaP datasets (e.g. data use limitations, disease area, data type).
  • Assembly of two data collections that allows investigators to submit a single DAR to gain access to most of the individual-level datasets in dbGaP approved for general research use (currently includes 96 datasets), or only the aggregated data from these datasets.
  • In an effort to promote transparency, the addition of a “Facts & Figures” section on the NIH GDS website to highlight current dbGaP data submission and access statistics, including DAR processing times and data management incidents.
  • Development of a mechanism to establish structured partnerships with external organizations or “trusted partners”.