Whole Genome Sequencing

Data Generation Protocols Data Analysis Protocols
Data Generation Protocols Data Analysis Protocols

Data Generation Protocols

The data generation protocols for the Burkitt Lymphoma (BL) project were acquired from the following manuscript.

Grande BM, Gerhard DS, Jiang A, et al. Genome-wide discovery of somatic coding and non-coding mutations in pediatric endemic and sporadic Burkitt lymphoma. Blood. March 2019; 21;133(12):1313-1324. (PMID: 30617194)

Whole genome sequencing of fresh frozen samples

Whole genome sequencing (WGS) libraries were constructed from DNA provided by Nationwide Children’s Hospital (Columbus, OH) using a polymerase chain reaction (PCR)-free protocol. To minimize library bias and coverage gaps associated with PCR amplification of high GC or AT-rich regions we have implemented a version of the TruSeq DNA PCR-free kit (E6875-6877B-GSC, New England Biolabs), automated on a Microlab NIMBUS liquid handling robot (Hamilton). Briefly, 500 ng of genomic DNA was arrayed in a 96-well microtitre plate and subjected to shearing by sonication (Covaris LE220). Sheared DNA was end-repaired and size selected using paramagnetic PCRClean DX beads (C-1003-450, Aline Biosciences) targeting a 300-400 bp fraction. After 3’ A-tailing, full length TruSeq adapters were ligated. Libraries were purified using paramagnetic (Aline Biosciences) beads. PCR-free genome library concentrations were quantified using a qPCR Library Quantification kit (KAPA, KK4824) prior to sequencing with paired-end 150 base reads on the Illumina HiSeqX platform using V4 chemistry according to manufacturer recommendations.

Whole genome sequencing of formalin-fixed, paraffin-embedded samples 

A 96-well library construction protocol was performed from formalin-fixed and paraffin-embedded (FFPE) tissue extracted genomic DNA provided by Nationwide Children’s Hospital (Columbus, OH). Since DNA extracted from FFPE tissue will be damaged by the fixation process and prolonged storage in non-ideal conditions, variable DNA quality across the collection is expected with some highly degraded samples. DNA was normalized to 500 ng in a volume of 62 μL elution buffer (Qiagen) and transferred into a microTUBE plate for shearing on an LE220 (Covaris) acoustic sonicator using the conditions: Duty Factor, 20%; Peak Incident Power, 450W; Cycle per burst, 200; Duration, 2 x 60 seconds with an intervening spin. The profile of sheared FFPE DNA extracted by the Qiagen Allprep DNA/RNA FFPE protocol has a dominant DNA peak in the size range between 300 and 400bp. To improve library quality of FFPE-derived DNA, solid phase reversible immobilization (SPRI) bead-based size selection was performed before library construction to remove smaller DNA fragments from highly degraded FFPE DNAs. If not removed early in the library construction process, these smaller fragments would otherwise dominate the final amplified library. FFPE DNA damage and end-repair and phosphorylation were combined in a single reaction using an enzymatic premix (NEB), then bead purified using a 0.8:1 (bead: sample) ratio to remove small FFPE fragments. Repaired DNA fragments were next A-tailed for ligation to paired-end, partial Illumina sequencing adapters then purified twice with SPRI beads (1:1 ratio). Full-length adaptered products were achieved by performing 8 cycles PCR with primers introducing fault-tolerant hexamer “barcodes” allowing multiplexing of libraries. Indexed PCR products were double purified with 1:1 beads. Concentration of final libraries was determined using size profiles obtained from a high sensitivity Caliper LabChip GX together with Quant-iT (Invitrogen) quantification.

Experimental Protocols

To request more information or approval regarding the following protocol, please contact BC Cancer at labqa@bcgsc.ca.

96-well PCR Free Library Construction on NIMBUS for Illumina Sequencing


Data Analysis Protocols

The data analysis protocols for the Burkitt Lymphoma (BL) project were acquired from the following manuscript.

Grande BM, Gerhard DS, Jiang A, et al. Genome-wide discovery of somatic coding and non-coding mutations in pediatric endemic and sporadic Burkitt lymphoma. Blood. March 2019; 21;133(12):1313-1324. (PMID: 30617194)

Simple somatic mutations 

Somatic single nucleotide variants (SNVs) and small insertions/deletions (indels), also known as simple somatic mutations (SSMs), were called from paired tumor-normal WGS data using the Strelka workflow (version 1.0.14). The default Strelka configuration for data aligned with bwa (strelka_config_bwa_default.ini) was used with the exception of filtering SNVs with a minimum QSS of 25 (default 15). For SNVs and indels, reference and alternate allele counts were taken from the Strelka output variant call format (VCF) file. SNVs and indels were annotated using vcf2maf (version 1.6.12) and Ensembl Variant Effect Predictor (release 86). Transcript selection for annotation was performed by vcf2maf with the following exception. Noncanonical transcripts were instead selected if they were non-synonymously mutated more commonly than the canonical transcript (minimum increase of two affected cases). SNVs and indels were further filtered for a minimum alternate allele count of six and a minimum variant allele fraction (VAF) of 10% and 20% for fresh frozen (FF) and formalin-fixed, paraffin-embedded (FFPE) tumors, respectively. Tumors with a median VAF below 25% were omitted from subsequent analyses due to either excessive noise or low predicted tumor content. The same pipeline was used for detecting SNVs and indels in the targeted validation sequencing data, with the exception that depth filters were disabled for Strelka (isSkipDepthFilters = 1).

Significantly mutated genes 

Considering only SNVs and indels, significantly mutated genes were identified using an ensemble approach integrating four methods: MutSigCV, OncodriveFM, OncodriveFML, and OncodriveCLUST. Mutations were lifted over from GRCh38 to GRCh37 using CrossMap (version 0.2.5) along with the “hg38ToHg19” chain file provided by the UCSC Genome Browser. Lifting over variants was necessary because some of the methods listed above rely on GRCh37 reference data. For consistency, the lifted-over mutations based on GRCh37 served as input for all methods. Non-synonymous mutations were defined as those with one of the following values in the MAF file Variant_Classification field, as annotated by vcf2maf: Splice_Site, Nonsense_Mutation, Frame_Shift_Del, Frame_Shift_Ins, Nonstop_Mutation, Translation_Start_Site, In_Frame_Ins, In_Frame_Del, or Missense_Mutation. To minimize noise, we only considered genes deemed significant (Q-value < 0.1) by two or more methods.  

BL-associated genes 

We defined BL-associated genes (BLGs) as any gene deemed significantly mutated in this study or previously described as recurrently mutated in BL with at least five affected patients in our discovery cohort. Only non-synonymous simple somatic mutations and copy number variations (minimum size 10 kbp) were considered. To avoid considering mainly large-scale events, copy number variations affecting a BLG were required to be relatively small with a median size of 10 Mbp or less. For each BLG, additional cryptic splicing variants (with support for aberrant splicing in RNA-seq data), structural variations, and copy number variations were manually curated.

McNemar’s tests 

Discordant cases were defined as EBV-negative endemic BL cases and EBV-positive sporadic BL cases. Differentially mutated genes and pathways (referred to here as features) were identified using the following criteria: (1) they must be mutated in at least 10% of cases, and (2) they were differentially mutated between EBV-positive endemic BL cases and EBV-negative sporadic BL cases (Q-value < 0.1, Fisher’s exact test). Discordant cases were excluded from the Fisher’s exact tests to ensure that there is no reason to believe a priori that the mutation status of these features are preferentially associated with tumor EBV status or clinical variant status. Following that, tumor EBV status and clinical variant status were used as naive predictors of the mutation status of these differentially mutated features and determined whether or not they were correct for each case. The performance of tumor EBV status and clinical variant status as predictors were compared using McNemar’s tests. Features with a significant difference according to the McNemar’s test (P-value < 0.05) indicate that the “winning” predictor is more strongly associated with the mutation status of said features.

Non-coding mutation peaks 

Genomic loci enriched in non-coding mutations, referred to here as “mutation peaks”, were identified using the previously described Rainstorm analysis with some adjustments to peak filtering and post-processing. Mutation peaks with a signal-to-noise ratio of 0 were ignored. Following this, peaks with a signal-to-noise ratio below the 95th percentile were also omitted. Mutation peaks were extended on each side by considering the inclusion of nearby mutated positions. Specifically, peaks were extended as follows: (1) for each mutated position within 10 kbp, the peak was extended up to and including said position; (2) for each of the original and extended peaks, the mutation rate (mutations/kbp) was calculated, resulting in a vector of mutation rates equal in length to the number of mutations within 10 kbp + 1 (for the original peak); (3) for each extended peak, the proportional change in mutation rate was calculated by taking the difference in mutation rate with the previous peak (i.e. the peak that omits the outermost mutated position) and dividing this difference by the mutation rate of the previous peak; (4) mutation peaks were extended up to and including the mutated position that preceded the mutated position that led to the largest proportional decrease in mutation rate. Following peak extension, peaks were only considered if they met the following criteria: five or more cases had mutations in the peak; the mutation rate was greater than or equal to 5 mutations/kbp; and the peak size was smaller than or equal to 20 kbp. Lastly, to remove mutation peaks likely caused by noise, peaks within 1 Mbp of a gap in the reference genome were ignored. Gaps were obtained from the University of California, Santa Cruz (UCSC) Table Browser, consisting of centromeres, heterochromatin regions and short chromosomal arms. Each gap was expanded by 100 kbp on both sides, and gaps within 3 Mbp of one another were merged. The remaining peaks were annotated with the nearest protein-coding, long non-coding RNA or microRNA gene. Peaks that overlapped one of the immunoglobulin heavy or light chain loci, namely IGH (chr14:104589639-107810399), IGK (chr2:87999518-90599757), or IGL (chr22:21031465-23905532), were annotated as such. Finally, peaks were labeled according to the immunoglobulin locus they overlapped or the nearest gene and their position relative to the gene’s transcription start site.

P-values were empirically determined for each peak by comparing its mutation rate with an empirical distribution produced by calculating the mutation rates of identically sized regions randomly sampled across the genome. The smallest and largest mutated position on each chromosome were used to determine the range of positions available for sampling with replacement. Positions that overlapped gaps in the reference genome such as centromeres and telomeres were excluded. A “pseudo-peak” was created from a sampled position by extending each side to create regions with the same size as the given mutation peak. The mutation rate of 100,000 such pseudo-peaks was calculated to generate the empirical null distribution of mutation rates genome-wide. The empirical P-value was calculated as the number of pseudo-peaks with a higher mutation rate than the given mutation peak divided by 100,000. Given that each mutation peak is tested against independent null distributions, the P-values did not require multiple test correction. All peaks were significant with empirical P-values < 0.001.

Enrichment for AICDA-mediated mutations 

A custom in-house Python (version 3.6.1) program was used to determine whether certain regions, such as significantly mutated genes and non-coding mutation peaks, were enriched for SNVs and indels that are presumed to be caused by AICDA-mediated mutagenesis. Enrichment for putative AICDA-mediated mutations in a given region was measured using two binomial exact tests. First, the observed number of mutations affecting AICDA recognition sites (number of successes), defined as regions that fit the AICDA motif (RGYW), was compared to the expected number of such mutations, which was calculated from the region’s mutation rate (probability of success) and the number of bases that overlap AICDA recognition sites (number of trials). Second, the observed number of mutations affecting the guanine-cytosine pair targeted by AICDA in the motif (number of successes) was compared to the expected number of such mutations, which was calculated from the region’s mutation rate of guanine-cytosine pairs (probability of success) and the number of target guanine-cytosine pairs in AICDA recognition sites (number of trials). Mutation rates were calculated using the effective region size, which is equal to the product of the region size and the cohort size. The effective region size ensures that the observed number of mutations (number of successes) is never higher than the region size (number of trials). Care was taken to avoid double-counting mutations if they overlapped more than one AICDA recognition site. This process was repeated for all regions of interest. The regions for BL-associated genes were based on the transcripts that were affected by non-synonymous as opposed to entire gene bodies. The entire regions of non-coding mutation peaks were considered. The in-house program also annotated mutations based on whether they overlapped an AICDA recognition site.

De novo mutational signatures 

Mutational signatures were discovered using the previously described framework by Alexandrov et al. We summarized somatic SNVs based on their mutational subtype, 5’ context, and 3’ context. This resulted in a mutation catalog matrix of 96 SNV classes for each sample. We performed non-negative matrix factorisation on our mutation catalog to discover mutational signatures within the entire cohort. Signature stability was computed by boostrap resampling over 1000 total iterations (10 iterations in each of 100 cores). The optimal n-signature solution, nopt , which simultaneously maximised signature stability and minimised the Frobenius reconstruction error, was automatically selected,

Equation for optimal n-signature solution

where and S are the vectors containing reconstruction errors and stability of each n-signature solution, and Rn and Sn are the reconstruction error and stability of the n-signature solution. This approach determined that the four-signature solution was optimal. To determine matches to known mutational signatures, cosine similarity metrics were computed against the 30 COSMIC reference mutational signatures. Where more than one signature matched to a single COSMIC signature, the highest similarity match was chosen and the remaining signatures were matched to the next most similar COSMIC signature. For each n-signature solution, the Pearson correlation was calculated between the age at diagnosis for each case and the predicted number of mutations attributable to de novo signatures associated with age (COSMIC reference signatures 1 and 5), taking the maximum correlation if both COSMIC signatures were paired. Similarly, for each n-signature solution, the Pearson correlation was calculated between AICDA expression for each case and the predicted number of mutations attributable to the de novo signature associated with AICDA activity (COSMIC reference signature 9).

Somatic structural variations 

Somatic structural variations (SVs) were detected using the Manta pipeline (version 1.1.0) in paired tumor-normal mode using default parameters with the exception of a minimum somatic score (SOMATICSCORE) of 45 (default 30). In FFPE samples, any inversions smaller than 500 bp were considered noise and ignored. Variant allele fractions were calculated from the reference and alternate allele counts reported in the Manta output VCF file. VCF files were converted to BEDPE format using the svtools vcftobedpe tool (version 0.3.2, commit 6d7b6ec8). SVs that overlapped any of the significantly mutated genes were manually curated for inclusion as non-synonymous mutations. IG-MYC translocations were identified as being any SV that met the following conditions: (1) one breakpoint was near MYC (chr8:126393182-130762146); (2) the breakpoint near MYC was oriented such that exons 2 and 3 are included in the rearrangement; (3) the other breakpoint was near an immunoglobulin heavy or light chain locus, namely IGH (chr14:104589639-107810399), IGK (chr2:87999518-90599757), or IGL (chr22:21031465-23905532); and (4) the highest-scoring translocation was selected in the event of multiple candidate SVs. Tumors where Manta failed to detect a translocation that met the above criteria were manually inspected for such events, which successfully revealed IG-MYC rearrangements in the remaining cases.

Somatic copy number variations 

Sequenza was used to call somatic copy number variations (CNVs) in tumor-normal pairs. Sequenza bam2seqz (parameters: –qlimit 30) generated the SEQZ files, which were then binned using Sequenza seqz-binning (parameters: -w 300 -s). To eliminate noise, germline heterozygous positions were filtered for any that overlap dbSNP (downloaded 2017-04-03) “common all” single nucleotide polymorphisms (SNPs). Using bedtools intersect (parameters: -wa), germline heterozygous positions were removed if they overlapped gaps in the reference genome (e.g. centromeres) or segmental duplications, which were obtained from the UCSC Table Browser. Previously, the segmental duplications were merged if they overlapped one another using bedtools merge, then filtered for a minimum size of 10 kbp, and subsequently merged again using bedtools merge (parameters: -d 10000). The Sequenza R package was used to load the binned SEQZ data, fit a model for cellularity and ploidy, and generate CNV segments. Sequenza was made aware of the sex of each case to properly handle CNVs on the sex chromosomes. To simplify model fitting and avoid incorrect local optima, ploidy and cellularity options were restricted as follows. Ploidy was limited to the range between 1.8 and 2.5. Cellularity was restricted to an estimate of tumor content derived from the variant allele fraction (VAF) of SNVs and indels, defined as twice the VAF corresponding to the first local density maximum below 50%.

Clonal B-cell receptors 

MiXCR (version 2.1.3) was used to identify immunoglobulin heavy and light chain clones from the RNA-seq and WGS data as per the standard pipeline described in their documentation. The MiXCR pipeline was also run on 323 DLBCL tumor samples that underwent a strand-specific poly[A]-selection RNA-seq protocol.[Ennishi_undated-wu] All RNA-seq reads were aligned using “mixcr align” (parameters: -p rna-seq -OallowPartialAlignments=true) while for the WGS data, only reads originating from the immunoglobulin regions (chr2:88668078-90584447, chr14:105548159-107030529, and chr22:21897318-23046831) or unmapped reads were aligned using “mixcr align” (parameters: -p rna-seq -OallowPartialAlignments=true -OvParameters.geneFeatureToAlign=VGeneWithP). Two rounds of contig assembly was performed using “mixcr assemblePartial” followed by clone assembly using “mixcr assemble”. Clones were exported using “mixcr exportClones” (parameters: -o -t) options to exclude any clones with out-of-frame sequences or stop codons. Clonal fraction was calculated for heavy and light chains separately. Dominant clones in the RNA-seq data were defined as having a clonal fraction of at least 30% with a minimum of 30 supporting reads. For the WGS analysis, dominant clones were defined as having the greatest clonal fraction with at least two supporting reads. The top-scoring V, D, J and C genes were selected for each clone when multiple genes were possible.

Last updated: June 15, 2020