Whole Genome Sequencing

Data Generation Protocols Data Analysis Protocols
CaCxIcon for HTMCP- Cervical Cancer Project CaCxIcon for HTMCP- Cervical Cancer Project

Data Generation Protocols

The data generation protocols for the HTMCP- Cervical Cancer project were acquired from the following manuscript.

Gagliardi A, Porter VL, Zong Z, et al. Analysis of Ugandan cervical carcinomas identifies human papillomavirus clade-specific epigenome and transcriptome landscapes. Nat Genet. 2020;52(8):800-810. (PMID: 32747824)

Whole genome sequencing library construction

One microgram of DNA was arrayed in each well of a 96-well plate and sheared by sonication (Covaris). Sheared DNA was end-repaired and size selected using AMPure XP beads targeting a 300-400bp fraction. After 3’ A-tailing, full length TruSeq adapters were ligated. Libraries were purified using AMPure XP beads. Library fragment sizes were assessed using an aliquot of PCR amplified library DNA on an Agilent 2100 Bioanalyzer DNA1000 chip, or Caliper GX DNA1000 chip.

Genome library construction for custom capture

DNA (500 ng) was sonicated (Covaris) to 250-350 bp, purified using PCRClean DX magnetic beads (Aline Biosciences), end-repaired, phosphorylated and bead purified before A-tailing using a custom NEB Paired-End Sample Prep Premix Kit. Illumina sequencing adapters were ligated overnight at 16oC, bead purified and enriched with 6 cycles of PCR using indexed primers enabling library pooling and sequenced using paired-end 125 base reads in a single flowcell lane.

Whole genome sequencing

Tumor genomes were sequenced to a target depth of 80X coverage, and normal blood samples to 40X coverage using 125bp reads. Sequencing was performed on an Illumina HiSeq2500.

Significantly mutated genes (SMGs)

MutSig2CV (https://software.broadinstitute.org/cancer/cga/mutsig)was used to identify significantly mutated genes (SMGs) using default parameters and a genome-wide coverage file. Genes meeting a threshold of q<0.1 and passing manual review in IGV for 3 samples with a mutation were considered to be SMGs.

Experimental protocols

To request more information or approval regarding the following protocol, please contact BC Cancer at labqa@bcgsc.ca

96-well PCR Free Library Construction on NIMBUS for Illumina Sequencing

Data Analysis Protocols

The data analysis protocols for the HTMCP- Cervical Cancer project were acquired from the following manuscript.

Gagliardi A, Porter VL, Zong Z, et al. Analysis of Ugandan cervical carcinomas identifies human papillomavirus clade-specific epigenome and transcriptome landscapes. Nat Genet. 2020;52(8):800-810. (PMID: 32747824)

Estimation of tumor content  

Tumor purity and ploidy were estimated using Ploidetect (L.C., in preparation). Tumor reads and heterozygous SNP allele frequencies in non-overlapping bins (~100 kb and equal coverage in the matched normal samples) were computed for each case. Read counts were modelled using Gaussian mixture models (GMM), modified to restrict component means as a fixed depth apart and component variances to be equal to one another. Allele frequencies were modelled using a separate GMM, incorporating priors from the first. Models were generated for each possible value of tumor purity and scored based on the mean likelihood of both the depth and the allele frequency GMMs. All results were verified by review of GMM parameters and their fit to the data. Estimates were congruent with observed copy number data in 104 out of 118 samples. In the remaining cases, purity and ploidy were determined by review of alternate models.

Somatic alteration detection

Tumor and normal sequencing reads were aligned to the human reference genome (hg19) using BWA-MEM v0.7.). Read duplicates were marked using sambamba(v0.5.5). Somatic single nucleotide variants (SNVs) were identified using Strelka (v1.0.6). A panel of 2,735 genes including mutated oncogenes, tumor suppressors, epigenetic modifiers, splicing factors, or other genes recurrently mutated (≥3 cases) in this cohort, were selected for targeting sequencing in the extended cohort. The coding mutation rate was reported for each tumor as the number of coding SNVs (low, moderate or high SNPeff annotation) per Mb.

Mutation signatures and HRD score

SNVs were categorized into 96 mutation classes based on 6 variant types and 16 trinucleotide contexts. For each sample, values of the 96 classes were used to compute a non-negative least squares deconvolution based on 30 previously described mutational signatures (COSMIC). The APOBEC signature reported for each sample is the max exposure value of signature 2 or 13.

HRD scores were computed using the R package HRDtools (v0.0.0.9), as the arithmetic sum of loss of heterozygosity (LOH), TAI, and LST scores, determined based on published guidelines.

Copy ratio landscape comparisons

Copy number alterations (CNAs) between cohorts were called and analysed using (v4.0.9, https://gatkforums.broadinstitute.org/gatk/discussion/9143/how-to-call-somatic-copy-number-variants-using-gatk4-cnv) and GISTIC2.0 (v2.0.17). Genomic intervals were prepared by dividing the reference genome into equally sized bins (1000bp). A panel of normals was generated to median sample reference counts. Allele counts were collected independently for the tumor and matching normal. Continuous segments were modelled with both the allelic ratios and copy ratios.

Germline CNAs previously identified in the TCGA cervical cancer (CESC) study were filtered out to remove any potential germline CNVs in this cohort. Segments were excluded if 75% or more of the segment overlapped with these.

Somatic copy number alterations in TCGA CESC tumors were determined previously using SNP 6.0 arrays11 and these were downloaded from the Broad GDAC website. The 178 samples in TCGA CESC core set were used for comparison to our HIV- samples.

To determine regions of CNA variance between cohorts (HIV- vs. HIV+, HIV- vs. TCGA), analyses were performed on each cohort separately according to the GISTIC2 (v2.0.22) documentation (http://software.broadinstitute.org/cancer/software/genepattern/modules/docs/GISTIC_2.0) with the parameters -qvt 0.25, -genegistic 1, -broad 1, -brlen 0.5, -conf 0.99, -armpeel 1, -savegene 1, -gcm extreme, -maxseg 3000. The genome was binned into 1kb segments and the fraction of patients having a copy gain (>0.1) or loss (<-0.1) was calculated based on the mean segment values for each cohort. Significantly amplified and deleted chromosome arms were identified using an FDR threshold < 0.25. Unique or shared arms and cytobands were identified as those significant in one cohort, and not the other.

Specific copy number alterations in samples

Regions of CNA in individual samples were identified using the Hidden Markov model-based approach CNAseq (v0.0.6). Amplifications were defined as those with predicted total copies greater than twice the estimated tumor ploidy, and homozygous deletions were defined as those predicted to have loss of all copies.

Non-coding mutation hotspots

Non-coding variants annotated by SNPeff as 5' Flank, 3' Flank, IGR, 3' UTR, Intron, 5'UTR, RNA, Splice_Site, Translation_Start_Site were used as input to Rainstorm , with all parameters set at default values (k=4).

Of the 3,094 hotspot regions identified, we focused on 3,539 potential point mutation hotspots, present in 3 or more samples.  These were filtered for those called by both Strelka and MutationSeq and did not reside in centromeric regions. Further filtering removed any variant called in a normal sample, reducing the potential non-coding hotspots to 404, of which 7 (high confidence) were confirmed by manual review.

Hotspots were annotated as ‘potential promoter’, ‘potential enhancer’, or ‘intergenic’ using ChIP-seq data (enhancer= intersect of H3K4me1 and H3K27ac peaks, promoter= H3K4me3).

We assessed the potential for other non-coding hotspots to alter transcription factor binding using motifBreakR26.

DNA methylation analysis

Human DNA methylation using the EPIC array (Illumina) was performed by The Centre for Applied Genomics, The Hospital for Sick Children, Toronto, Canada. The DNA methylation beta-values for 115 samples were binarized as unmethylated (beta<=0.25) and methylated (beta>0.25). The 8,000 most variable probes were clustered using the ConsensusClusterPlus (v1.38.0, R) with a ‘binary’ distance and ‘ward.D2’ clustering method with 1,000 iterations.

Differentially methylated probes (DMP) and differentially methylated regions (DMR) between clade A7- and A9-infected samples were determined using CHAMP (v2.10.2, R) (q<0.05); DMRs used the ‘bumphunter’ method. For the DMPs, associated gene, genomic feature, and CpG island features of these probes came from CHAMP. DMRs were intersected with protein-coding genes (hg19 Ensembl (v75), n=20,232) using bedtools (v2.27.1).

Human and viral gene expression and gene ontology enrichment analyses

Clustering analysis was performed with ConsenusClusterPlus (v1.38.0, R) using log10(RPKM) values using the ‘Pearson’ method and ‘ward.D2’ linkage with 1000 iterations. Human genes used included the top 1,000 most variable genes (RPKM > 5 in at least one patient). All 118 samples were included in human gene clustering and 117 in viral gene clustering (no gff file was available for HPV51).

Differential gene expression between groups (A7 vs. A9; E6 and E7 high vs. low) was performed using the DESeq2 (v1.14.1, R). Genes were filtered using an adjusted p-value <0.05, >1.5-fold change in mean expression, and a baseMean expression >1000. For the A7 vs. A9 comparison, the differential analysis was normalized for histology using a multifactorial approach. Results from the normalized analysis were compared to those using only squamous A7 and A9 samples to ensure histology correction was only removing expression differences attributed to histologies (89% concordance).

Functional enrichment of the significantly differentially expressed genes in the A7 vs. A9 comparison was performed using STRING (v11.0). For visualisation, enrichment scores for A7-enriched ontologies were set to negative values. Functional enrichment of the significantly differentially expressed upregulated and downregulated gene lists for the E6 and E7 analysis was performed using HOMER (v4.10.3).

ERV quantification

5,467,457 repeat elements and hg19 coordinates (chromosomes 1-22, X) were downloaded from RepeatMasker Open v.4.0.5 (http://www.repeatmasker.org/faq.htm). To minimize read count bias from nearby expressed protein-coding genes, we filtered for ERVs >10 kb away from their nearest gene. Raw expression values were calculated by counting the number of reads that mapped unambiguously (mates mapped within 10 kb) to each region and were normalized for sequencing depth and length by conversion to RPKMs.

Last updated: August 07, 2020