1. Obtaining Data Use Certification through dbGaP
Controlled-access data can be used by the research community, but users are required to obtain Data Use Certification (DUC) through National Center for Biotechnology Information’s database of Genotypes and Phenotypes (NCBI’s dbGaP). The proposed research must be consistent with the Data Use Limitations (DULs) for the requested data (i.e. general research use (most CGCI projects and HCMI) or pediatric cancer research (TARGET)). There are data use limitations for some CGCI cases contributed by specific tissue source sites. Researchers must get a separate DUC for each consent group by submitting an electronic Data Access Request (DAR) through dbGaP for each consent group.
- Detailed instructions to apply for controlled-access data are provided at NCI’s GDC.
- All users must have an eRA Commons account or HHS credentials (for intramural investigators) to submit a Data Access Request (DAR) form through NCBI’s dbGaP authorized access system.
*** The password on the eRA Commons account needs to be updated periodically as required by the eRA Commons. Instructions are distributed when the account is created. ***
- The following flowcharts provide an overview of the DAR process in dbGaP:
***Failure to submit a renewal or complete the closeout process may result in termination of all current data access.
2. Controlled Data at NCI’s GDC and NCBI’s Sequence Read Archive (SRA)
CGCI and TARGET data can be accessed through CGCI and TARGET data matrices. HCMI data can be accessed through the GDC data portal. The various databases include different types of data. For specifics, see below and read the materials found on the specific project web pages.
The Genomic Data Commons (GDC) is a data repository that accepts and standardizes genomic, clinical and biospecimen data from cancer research programs and enables data sharing. The GDC provides a platform for efficient querying, analyzing, and downloading harmonized clinical, biospecimen, and sequence data across multiple projects.
CGCI (phs 000235)
CGCI data at the GDC include:
- Raw and aligned reads from next-generation sequencing; specifics are on the Data Matrix
- Analyzed data generated by the GDC’s analysis pipeline
CGCI data at the NCBI SRA include:
- Aligned reads from next generation sequencing (BAM files) from Non-Hodgkin lymphoma (Diffuse Large B-Cell Lymphoma and Follicular Lymphoma) project phs000532
TARGET (phs000218)
TARGET data at the GDC include:
- Raw and aligned reads from next-generation sequencing; specifics are on the Data Matrix
- Some aggregate data (including mutation calls and other associated molecular data)
TARGET data at the NCBI SRA include:
- FASTQ/BAM files – next-generation whole genome, whole exome, mRNA-seq, miRNA-seq, targeted capture, bisulfite-seq, ChIP-seq
HCMI (phs001486)
HCMI data at the GDC will be described on the Data Matrix when available.
- Raw sequencing data
- Harmonized datasets
3. Controlled-access Data at OCG’s Data Coordinating Center (DCC)
The Data Coordinating Center (DCC) is responsible for managing the flow of data generated by the OCG programs. The DCC houses raw, processed, and analyzed data produced by OCG project teams for the projects’ manuscripts. The OCG program data available at the DCC differs from what is at the GDC in that the GDC downloads all the raw CGCI and TARGET next-generation sequencing data and performs their own analysis and harmonization through the GDC analysis pipeline, producing their own L3/analyzed data files.
Data users outside of HHS require eRA Commons account credentials to log on to Globus.org to access controlled data housed at OCG’s Data Coordinating Center (DCC). dbGaP-approved PIs and designated downloaders will receive an email with detailed instructions on how to use Globus.org to access controlled data at the OCG DCC data upon approval.
CGCI
CGCI data stored at the OCG’s DCC include:
- Protected clinical information
- Raw chip-based molecular characterization data
- Project team processed sequencing data (upper level files e.g. VCF or MAF)
- Epstein-Barr virus BAM files from pediatric Burkitt lymphoma cases
TARGET
TARGET data stored at the OCG’s DCC include:
- Protected clinical information
- Project team processed sequencing data (upper level files e.g. VCF or MAF)
4. Where to Get Help If You Have Trouble Accessing Data