Using CGCI Data
- ANNOUNCEMENT -
The CGCI data matrix will not function properly in Internet Explorer unless the Compatibility View is completely turned off.
The CGCI Initiative produces large-scale genomic data sets for adult and pediatric cancers. As participants in a “community resource projectOpens in a New Tab,” CGCI members are required to share data with the broader research community to facilitate and encourage new discoveries. Participants in the HIV+ Tumor Molecular Characterization Project (HTMCP) and the Burkitt Lymphoma Genome Sequencing Project (BLGSP) must be familiar with the CGCI Publication Guidelines.
Read the following user guide to learn how to search and download data generated by CGCI.
- About the Data
- Open vs. Controlled Access
- How to Access Protected Data
- About the CGCI Data Matrix
- How to Navigate the CGCI Data Matrix
CGCI researchers use sequencing and, in some cases, array-based methods to find novel genetic alterations in tumors. They analyze high quality tissue samples using one or more of the following genomic approaches:
- Whole Genome Sequencing
- Whole Exome Sequencing (2nd generation)
- Transcriptome Sequencing (2nd generation)
- Targeted Gene sequencing (Sanger)
- Gene Expression Profiling
- Copy Number Analysis (SNP arrays)
To learn more about these approaches, visit the CGCI Research page.
CGCI employs stringent human subjects’ protection and data access policies to protect the privacy and confidentiality of the research participants. Depending on the risk of patient identification, CGCI data are available to the scientific community in two tiers: open or controlled access. Both types of data can be accessed through the CGCI Data Matrix.
Data within this category presents minimal risk of participant identification. Much of CGCI data, excluding patient identifiers, are open-access. CGCI provides the scientific community the maximum amount of open-access data allowable under HIPAA guidelines. Access to this data does not require user certification, and researchers may explore data content without restriction.
Examples of open-access data
- Clinical information that could not be used to identify the patient
- Tissue pathology data
- Gene expression data (other than 1º exon array data or mRNA-seq)
- Tumor-specific copy number alterations and loss of heterozygosity
- Sequence data of single amplicons (matched tumor and normal when available; cannot be assembled to link to an individual)
- Tumor-associated (somatic) mutations
Data within this category presents a higher risk of patient identification. While stripped of direct patient identifiers as defined by HIPAA, controlled-access data contains specific demographic, clinical, and genotypic information that are excluded in open-access data. Controlled-access data is unique and valuable for research projects for which the open-access data are insufficient. Access to this data requires user certification which can be obtained through NCBI’s dbGaP (National Center for Biotechnology Information’s database of Genotypes and Phenotypes). Researchers apply for access by filling out a Data Access Request form. Read “How to Access Protected Data” below for more information.
Examples of controlled-access data
- Specific demographic and clinical data
- Genome-wide genotypes for each case
- Information linking all sequence traces to an individual
- Whole genome, exome or transcriptome sequences for an individual case
General Outline of Instructions:
- Obtain Data Use Certification through dbGaP
- Maintain User Accounts for Data Access
- Access Data via the CGCI Data Matrix
- Use HHS credentials (intramural investigator) or NCI-issued user account to directly access all data stored in NCI databases
- Use HHS credentials (intramural investigator) or NIHEXT credentials (extramural investigator) account to access data stored in NCBI databases
- Get Help If You Have Trouble Accessing Data
1. Obtain Data Use Certification through dbGaP
All users requesting access to controlled data must:
Have an eRA Commons account or HHS credentials (for intramural investigator) to submit requests for access. Further information can be found on the NCBI dbGaP homepage.
Complete the electronic dbGaP Data Access Request (SF 424 (R&R)) form, which requires a brief description of the investigator’s intended use of the data. To get approved for a Data Use Certification (DUC), the requestors must:
- Agree to restrict their use of the information for biomedical research purposes only.
- Agree not to try to identify and/or contact the patients.
- Submit requests that agree with the Data Use Limitations specific to the desired data’s appropriate consent group. CGCI has datasets that fall within two separate consent groups. Investigators must get a DUC for each consent group to gain access to all CGCI datasets.
CGCI Data Use Certifications
|Consent Group||Pediatric Cancer Research||Cancer Research and General Methods|
|Types of Data||Pediatric Medulloblastoma||Adult HIV-related, lymphoid (including Burkitt lymphoma), and lung cancers|
|Data Use Limitations||Access to protected pediatric data will be granted solely for those research projects that can only be conducted using pediatric data (i.e., the research objectives cannot be accomplished using data from adults) and that focus on the development of more effective treatments, diagnostic tests, or prognostic markers for childhood cancers.||Use of the data is limited to scientific research relevant to the biology, prevention, treatment, and late complications of cancers and for the development of applications proposing analytical methods, software, and other research tools.|
- Submit the completed SF 424 (R&R) form electronically to dbGAP for consideration of data access approval.
- Upon SF 424 (R&R) form submission, the signing official of the Principal Investigator’s institution will be notified of the submission and asked to certify agreement with the Data Use Limitations stated within the Data Access Request form.
- After the signing official has certified agreement, the SF 424 (R&R) application will be sent to the NCI Data Access Committee (DAC) to review for approval. The approval review process can take 2-4 weeks.
- Currently, approval in the form of an individual DUC allows the investigator data access to that consent group’s data for one calendar year.
- Submit a progress report to the DAC no later than one year after obtaining the DUC. The requestor needs to understand that a progress report is a current condition for the data access. Approved users may also apply for renewal to access protected data at the same time they submit the reports. A reminder to submit an annual progress report and renew approval status, if needed, will be sent by the DAC staff approximately one month before the access termination deadline. If the requestor does not submit the progress report or requests a renewal, access to the data will cease.
2. Maintain User Accounts for Data Access
Intramural investigators with an approved DUC may access protected CGCI data using their HHS credentials.
Investigators outside of HHS with an approved DUC require two separate user accounts to access protected CGCI data:
- For access to CGCI data stored and maintained at NCBI – approved users can access CGCI data stored at NCBI using the eRA Commons account associated with the original Data Access Request. CGCI data stored at NCBI includes Sanger sequencing files and aligned reads from 2nd generation sequencing (BAM files).
- For access to data stored and maintained at the OCG Data Coordinating Center (DCC) at the National Cancer Institute (NCI) – approved users outside of HHS will be issued an NIH External (NIHEXT) user account from NIH, if none already exists, immediately after obtaining a DUC. This NIH-issued account will be used to access data at the OCG DCC, which includes most of the genomic data generated for the CGCI initiative (clinical information, all levels of chip-based molecular characterization, and higher level sequencing data). ***The password on this account needs to be updated every 120 days, and instructions are distributed when the account is created***
3. Access Protected Data via the CGCI Data Matrices
Approved users may access protected CGCI data through the CGCI Data Matrices with either HHS credentials or the appropriate external account (as outlined in #2).
- Access data stored at the OCG DCC directly through the CGCI Data Matrices (requires NCI-issued account for extramural investigators):
- Protected clinical information
- All levels chip-based molecular characterization data
- Processed sequencing data (upper level files, excluding BAM files)
- Access low-level sequence files stored at NCBI indirectly through hyperlinks on the CGCI Data Matrices (requires eRA Commons account for extramural investigators):
- Trace sequences stored in the NCBI TRACE Archives
Sanger targeted sequencing
- BAM files stored in the Sequence Read Archives accessible through NCBI dbGaP
2nd/3rd generation whole genome, exome, mRNA-seq, miRNA-seq
- Trace sequences stored in the NCBI TRACE Archives
4. Get Help If You Have Trouble Accessing Data
For NCI-stored data – OCG@mail.nih.gov
For NCBI-stored data – https://dbgap.ncbi.nlm.nih.gov/aa/wga.cgi?page=email&from=login
CGCI data is accessible through tabular, easy-to-use Data Matrices. Each project within CGCI has its own Data Matrix. In addition to the individual project-specific data matrices, there is one composite CGCI Data Matrix that provides links to data from all CGCI projects. New data from ongoing projects gets incorporated into the Data Matrices as it becomes available, along with an update of the matrix version history. Users should note the version of the CGCI Data Matrix when accessing information.
The CGCI Data Matrix evolves over time to meet the needs of the research community. We encourage users to send comments, questions, and suggestions for improvement to email@example.com.
The Data Matrices link to both open and controlled access CGCI data. To obtain specific datasets or metadata, including descriptions of each project, users can hover over the text within the table and click to access the appropriate files.
Raw or low level data files (level 1)
Normalized and integrated data (levels 2 and 3)
Summarized findings (level 4)
Data Access Code
Blue = open access
Red = controlled access (NCI & NCBI)
Black = unavailable
Types of Data Found in the Matrix
Names of diseases studied
Clinical information, including outcomes
Types of molecular data generated and platforms used
Metadata descriptions about each individual project
Multi-level chip-based and sequencing data links