Experimental Methods for Burkitt Lymphoma Genome Sequencing Project

On this page researchers can find detailed information describing how CGCI data was generated by genomic platform, including protocols for establishing high-quality nucleic acid samples. The sequencing protocols were performed at British Columbia Cancer (BC Cancer).

CGCI Sample Naming 

CGCI samples are named using a coding system specific to OCG characterization programs. Please use the OCG Sample Codes document to properly discern the metadata reiterated within the sample name. OCG Sample Codes_finalized 05 2017.pdf

Nucleic Acid Sample Processing

CGCI project teams use high-quality RNA and DNA from case-matched tumor and normal tissues to generate comprehensive genomics data. See the Sample Preparation Protocols below for details.

General Methodology Sample Preparation Protocols
DNA/RNA AllPrep Kit (Qiagen), mirVana miRNA Isolation Kit (Ambion), High Pure miRNA Kit (Roche), and QiaAmp blood midi kit (Qiagen) Burkitt Lymphoma Data Generation Protocol IconBL
Sample Preparation Protocols

The protocol herein describes the procedures used by Nationwide Children's Hospital to process normal and disease tissues for RNA and/or DNA subsequently used for characterization in the NCI’s CGCI initiative.

The experimental protocols for the Burkitt Lymphoma (BL) project were acquired from the following manuscript.

Grande BM, Gerhard DS, Jiang A, et al. Genome-wide discovery of somatic coding and non-coding mutations in pediatric endemic and sporadic Burkitt lymphoma. Blood. March 2019; 21;133(12):1313-1324. (PMID: 30617194)

Sample processing and nucleic acid extraction

Frozen specimens were shipped to and from Nationwide Children’s Hospital (Columbus, OH) using a cryoport that maintained an average temperature of less than -180°C (SOP #308). A top and bottom histologic section were cut from tumor and uninvolved tissue (if it was to be used for healthy tissue control) for pathologic quality control review. These were either stained with H&Eor Wright-Giemsa and imaged at 40X using an Aperio AT Turbo or Aperio AT2 scanner. Images were reviewed by a board-certified pathologist to confirm that the tumor specimen was histologically consistent with Burkitt lymphoma, and that uninvolved specimens contained no tumor cells. The tumor sections were required to contain a minimum of 50% tumor cell nuclei, and less than 50% necrosis for inclusion in the study. Nearly all samples had less than 20% necrosis. RNA and DNA were extracted from frozen (SOP #305) and FFPE tumor (SOP #315-316) and normal tissue specimens (mainly blood or granulocytes) using a modification of the DNA/RNA AllPrep kit (Qiagen). Frozen samples were homogenized and applied to a Qiagen DNA column, and FFPE samples were deparaffinized and applied to a Qiagen FFPE DNA column. The flow through from the Qiagen DNA column was processed using a mirVana miRNA Isolation Kit (Ambion) for frozen tissues, and a High Pure miRNA Kit (Roche) for FFPE tissues. This latter step generated RNA preparations that included RNA <200 nt suitable for miRNA analysis. DNA was extracted from blood using the QiaAmp blood midi kit (Qiagen; SOP #307). DNA was quantified by PicoGreen assay and was resolved by 1% agarose gel electrophoresis to confirm high molecular weight fragments. A custom Sequenom SNP panel or the AmpF/STR Identifiler (Applied Biosystems) was utilized to verify tumor DNA and germline DNA were derived from the same patient. One hundred nanograms of each tumor and normal DNA were sent in duplicate to Qiagen for REPLI-g whole genome amplification using a 100 μg reaction scale. RNA was quantified by measuring Abs260 with a UV spectrophotometer, and integrity was measured using the RNA6000 nano assay (Agilent) to determine the RNA Integrity Number (RIN) for frozen samples or DV200 for FFPE samples. For inclusion in the discovery set, a tumor needed to pass pathology consensus review (University of Nebraska Medical Center, Omaha, NE) and the specimen pathology quality control review (Nationwide Children’s Hospital, Columbus, OH). In addition, a primary tumor and a matched germline (blood, buccal, or uninvolved tissue) sample needed to pass the following metrics: a minimum of 0.7 μg of DNA from frozen or 0.25 μg of DNA from FFPE, and 3 μg RNA from frozen or 1 μg RNA from FFPE. The minimum RNA integrity metrics were a RIN above 7.0 or DV200 above 30. Cases that did not meet these metrics were included in the validation set if there was at least 0.7 μg of DNA from the primary tumor available for DNA sequencing. Tumor RNA sequencing was also performed for validation cases if there was sufficient RNA material.

Expand All

Last updated: June 23, 2020
Sample Preparation Protocols
Sample Preparation and Nucleic acid extraction

Whole Genome Sequencing

Data Generation Protocols Data Analysis Protocols
Data Generation Protocols Data Analysis Protocols

Data Generation Protocols

The data generation protocols for the Burkitt Lymphoma (BL) project were acquired from the following manuscript.

Grande BM, Gerhard DS, Jiang A, et al. Genome-wide discovery of somatic coding and non-coding mutations in pediatric endemic and sporadic Burkitt lymphoma. Blood. March 2019; 21;133(12):1313-1324. (PMID: 30617194)

Whole genome sequencing of fresh frozen samples

Whole genome sequencing (WGS) libraries were constructed from DNA provided by Nationwide Children’s Hospital (Columbus, OH) using a polymerase chain reaction (PCR)-free protocol. To minimize library bias and coverage gaps associated with PCR amplification of high GC or AT-rich regions we have implemented a version of the TruSeq DNA PCR-free kit (E6875-6877B-GSC, New England Biolabs), automated on a Microlab NIMBUS liquid handling robot (Hamilton). Briefly, 500 ng of genomic DNA was arrayed in a 96-well microtitre plate and subjected to shearing by sonication (Covaris LE220). Sheared DNA was end-repaired and size selected using paramagnetic PCRClean DX beads (C-1003-450, Aline Biosciences) targeting a 300-400 bp fraction. After 3’ A-tailing, full length TruSeq adapters were ligated. Libraries were purified using paramagnetic (Aline Biosciences) beads. PCR-free genome library concentrations were quantified using a qPCR Library Quantification kit (KAPA, KK4824) prior to sequencing with paired-end 150 base reads on the Illumina HiSeqX platform using V4 chemistry according to manufacturer recommendations.

Whole genome sequencing of formalin-fixed, paraffin-embedded samples 

A 96-well library construction protocol was performed from formalin-fixed and paraffin-embedded (FFPE) tissue extracted genomic DNA provided by Nationwide Children’s Hospital (Columbus, OH). Since DNA extracted from FFPE tissue will be damaged by the fixation process and prolonged storage in non-ideal conditions, variable DNA quality across the collection is expected with some highly degraded samples. DNA was normalized to 500 ng in a volume of 62 μL elution buffer (Qiagen) and transferred into a microTUBE plate for shearing on an LE220 (Covaris) acoustic sonicator using the conditions: Duty Factor, 20%; Peak Incident Power, 450W; Cycle per burst, 200; Duration, 2 x 60 seconds with an intervening spin. The profile of sheared FFPE DNA extracted by the Qiagen Allprep DNA/RNA FFPE protocol has a dominant DNA peak in the size range between 300 and 400bp. To improve library quality of FFPE-derived DNA, solid phase reversible immobilization (SPRI) bead-based size selection was performed before library construction to remove smaller DNA fragments from highly degraded FFPE DNAs. If not removed early in the library construction process, these smaller fragments would otherwise dominate the final amplified library. FFPE DNA damage and end-repair and phosphorylation were combined in a single reaction using an enzymatic premix (NEB), then bead purified using a 0.8:1 (bead: sample) ratio to remove small FFPE fragments. Repaired DNA fragments were next A-tailed for ligation to paired-end, partial Illumina sequencing adapters then purified twice with SPRI beads (1:1 ratio). Full-length adaptered products were achieved by performing 8 cycles PCR with primers introducing fault-tolerant hexamer “barcodes” allowing multiplexing of libraries. Indexed PCR products were double purified with 1:1 beads. Concentration of final libraries was determined using size profiles obtained from a high sensitivity Caliper LabChip GX together with Quant-iT (Invitrogen) quantification.

Experimental Protocols

To request more information or approval regarding the following protocol, please contact BC Cancer at labqa@bcgsc.ca.

96-well PCR Free Library Construction on NIMBUS for Illumina Sequencing

 

Data Analysis Protocols

The data analysis protocols for the Burkitt Lymphoma (BL) project were acquired from the following manuscript.

Grande BM, Gerhard DS, Jiang A, et al. Genome-wide discovery of somatic coding and non-coding mutations in pediatric endemic and sporadic Burkitt lymphoma. Blood. March 2019; 21;133(12):1313-1324. (PMID: 30617194)

Simple somatic mutations 

Somatic single nucleotide variants (SNVs) and small insertions/deletions (indels), also known as simple somatic mutations (SSMs), were called from paired tumor-normal WGS data using the Strelka workflow (version 1.0.14). The default Strelka configuration for data aligned with bwa (strelka_config_bwa_default.ini) was used with the exception of filtering SNVs with a minimum QSS of 25 (default 15). For SNVs and indels, reference and alternate allele counts were taken from the Strelka output variant call format (VCF) file. SNVs and indels were annotated using vcf2maf (version 1.6.12) and Ensembl Variant Effect Predictor (release 86). Transcript selection for annotation was performed by vcf2maf with the following exception. Noncanonical transcripts were instead selected if they were non-synonymously mutated more commonly than the canonical transcript (minimum increase of two affected cases). SNVs and indels were further filtered for a minimum alternate allele count of six and a minimum variant allele fraction (VAF) of 10% and 20% for fresh frozen (FF) and formalin-fixed, paraffin-embedded (FFPE) tumors, respectively. Tumors with a median VAF below 25% were omitted from subsequent analyses due to either excessive noise or low predicted tumor content. The same pipeline was used for detecting SNVs and indels in the targeted validation sequencing data, with the exception that depth filters were disabled for Strelka (isSkipDepthFilters = 1).

Significantly mutated genes 

Considering only SNVs and indels, significantly mutated genes were identified using an ensemble approach integrating four methods: MutSigCV, OncodriveFM, OncodriveFML, and OncodriveCLUST. Mutations were lifted over from GRCh38 to GRCh37 using CrossMap (version 0.2.5) along with the “hg38ToHg19” chain file provided by the UCSC Genome Browser. Lifting over variants was necessary because some of the methods listed above rely on GRCh37 reference data. For consistency, the lifted-over mutations based on GRCh37 served as input for all methods. Non-synonymous mutations were defined as those with one of the following values in the MAF file Variant_Classification field, as annotated by vcf2maf: Splice_Site, Nonsense_Mutation, Frame_Shift_Del, Frame_Shift_Ins, Nonstop_Mutation, Translation_Start_Site, In_Frame_Ins, In_Frame_Del, or Missense_Mutation. To minimize noise, we only considered genes deemed significant (Q-value < 0.1) by two or more methods.  

BL-associated genes 

We defined BL-associated genes (BLGs) as any gene deemed significantly mutated in this study or previously described as recurrently mutated in BL with at least five affected patients in our discovery cohort. Only non-synonymous simple somatic mutations and copy number variations (minimum size 10 kbp) were considered. To avoid considering mainly large-scale events, copy number variations affecting a BLG were required to be relatively small with a median size of 10 Mbp or less. For each BLG, additional cryptic splicing variants (with support for aberrant splicing in RNA-seq data), structural variations, and copy number variations were manually curated.

McNemar’s tests 

Discordant cases were defined as EBV-negative endemic BL cases and EBV-positive sporadic BL cases. Differentially mutated genes and pathways (referred to here as features) were identified using the following criteria: (1) they must be mutated in at least 10% of cases, and (2) they were differentially mutated between EBV-positive endemic BL cases and EBV-negative sporadic BL cases (Q-value < 0.1, Fisher’s exact test). Discordant cases were excluded from the Fisher’s exact tests to ensure that there is no reason to believe a priori that the mutation status of these features are preferentially associated with tumor EBV status or clinical variant status. Following that, tumor EBV status and clinical variant status were used as naive predictors of the mutation status of these differentially mutated features and determined whether or not they were correct for each case. The performance of tumor EBV status and clinical variant status as predictors were compared using McNemar’s tests. Features with a significant difference according to the McNemar’s test (P-value < 0.05) indicate that the “winning” predictor is more strongly associated with the mutation status of said features.

Non-coding mutation peaks 

Genomic loci enriched in non-coding mutations, referred to here as “mutation peaks”, were identified using the previously described Rainstorm analysis with some adjustments to peak filtering and post-processing. Mutation peaks with a signal-to-noise ratio of 0 were ignored. Following this, peaks with a signal-to-noise ratio below the 95th percentile were also omitted. Mutation peaks were extended on each side by considering the inclusion of nearby mutated positions. Specifically, peaks were extended as follows: (1) for each mutated position within 10 kbp, the peak was extended up to and including said position; (2) for each of the original and extended peaks, the mutation rate (mutations/kbp) was calculated, resulting in a vector of mutation rates equal in length to the number of mutations within 10 kbp + 1 (for the original peak); (3) for each extended peak, the proportional change in mutation rate was calculated by taking the difference in mutation rate with the previous peak (i.e. the peak that omits the outermost mutated position) and dividing this difference by the mutation rate of the previous peak; (4) mutation peaks were extended up to and including the mutated position that preceded the mutated position that led to the largest proportional decrease in mutation rate. Following peak extension, peaks were only considered if they met the following criteria: five or more cases had mutations in the peak; the mutation rate was greater than or equal to 5 mutations/kbp; and the peak size was smaller than or equal to 20 kbp. Lastly, to remove mutation peaks likely caused by noise, peaks within 1 Mbp of a gap in the reference genome were ignored. Gaps were obtained from the University of California, Santa Cruz (UCSC) Table Browser, consisting of centromeres, heterochromatin regions and short chromosomal arms. Each gap was expanded by 100 kbp on both sides, and gaps within 3 Mbp of one another were merged. The remaining peaks were annotated with the nearest protein-coding, long non-coding RNA or microRNA gene. Peaks that overlapped one of the immunoglobulin heavy or light chain loci, namely IGH (chr14:104589639-107810399), IGK (chr2:87999518-90599757), or IGL (chr22:21031465-23905532), were annotated as such. Finally, peaks were labeled according to the immunoglobulin locus they overlapped or the nearest gene and their position relative to the gene’s transcription start site.

P-values were empirically determined for each peak by comparing its mutation rate with an empirical distribution produced by calculating the mutation rates of identically sized regions randomly sampled across the genome. The smallest and largest mutated position on each chromosome were used to determine the range of positions available for sampling with replacement. Positions that overlapped gaps in the reference genome such as centromeres and telomeres were excluded. A “pseudo-peak” was created from a sampled position by extending each side to create regions with the same size as the given mutation peak. The mutation rate of 100,000 such pseudo-peaks was calculated to generate the empirical null distribution of mutation rates genome-wide. The empirical P-value was calculated as the number of pseudo-peaks with a higher mutation rate than the given mutation peak divided by 100,000. Given that each mutation peak is tested against independent null distributions, the P-values did not require multiple test correction. All peaks were significant with empirical P-values < 0.001.

Enrichment for AICDA-mediated mutations 

A custom in-house Python (version 3.6.1) program was used to determine whether certain regions, such as significantly mutated genes and non-coding mutation peaks, were enriched for SNVs and indels that are presumed to be caused by AICDA-mediated mutagenesis. Enrichment for putative AICDA-mediated mutations in a given region was measured using two binomial exact tests. First, the observed number of mutations affecting AICDA recognition sites (number of successes), defined as regions that fit the AICDA motif (RGYW), was compared to the expected number of such mutations, which was calculated from the region’s mutation rate (probability of success) and the number of bases that overlap AICDA recognition sites (number of trials). Second, the observed number of mutations affecting the guanine-cytosine pair targeted by AICDA in the motif (number of successes) was compared to the expected number of such mutations, which was calculated from the region’s mutation rate of guanine-cytosine pairs (probability of success) and the number of target guanine-cytosine pairs in AICDA recognition sites (number of trials). Mutation rates were calculated using the effective region size, which is equal to the product of the region size and the cohort size. The effective region size ensures that the observed number of mutations (number of successes) is never higher than the region size (number of trials). Care was taken to avoid double-counting mutations if they overlapped more than one AICDA recognition site. This process was repeated for all regions of interest. The regions for BL-associated genes were based on the transcripts that were affected by non-synonymous as opposed to entire gene bodies. The entire regions of non-coding mutation peaks were considered. The in-house program also annotated mutations based on whether they overlapped an AICDA recognition site.

De novo mutational signatures 

Mutational signatures were discovered using the previously described framework by Alexandrov et al. We summarized somatic SNVs based on their mutational subtype, 5’ context, and 3’ context. This resulted in a mutation catalog matrix of 96 SNV classes for each sample. We performed non-negative matrix factorisation on our mutation catalog to discover mutational signatures within the entire cohort. Signature stability was computed by boostrap resampling over 1000 total iterations (10 iterations in each of 100 cores). The optimal n-signature solution, nopt , which simultaneously maximised signature stability and minimised the Frobenius reconstruction error, was automatically selected,

Equation for optimal n-signature solution

where and S are the vectors containing reconstruction errors and stability of each n-signature solution, and Rn and Sn are the reconstruction error and stability of the n-signature solution. This approach determined that the four-signature solution was optimal. To determine matches to known mutational signatures, cosine similarity metrics were computed against the 30 COSMIC reference mutational signatures. Where more than one signature matched to a single COSMIC signature, the highest similarity match was chosen and the remaining signatures were matched to the next most similar COSMIC signature. For each n-signature solution, the Pearson correlation was calculated between the age at diagnosis for each case and the predicted number of mutations attributable to de novo signatures associated with age (COSMIC reference signatures 1 and 5), taking the maximum correlation if both COSMIC signatures were paired. Similarly, for each n-signature solution, the Pearson correlation was calculated between AICDA expression for each case and the predicted number of mutations attributable to the de novo signature associated with AICDA activity (COSMIC reference signature 9).

Somatic structural variations 

Somatic structural variations (SVs) were detected using the Manta pipeline (version 1.1.0) in paired tumor-normal mode using default parameters with the exception of a minimum somatic score (SOMATICSCORE) of 45 (default 30). In FFPE samples, any inversions smaller than 500 bp were considered noise and ignored. Variant allele fractions were calculated from the reference and alternate allele counts reported in the Manta output VCF file. VCF files were converted to BEDPE format using the svtools vcftobedpe tool (version 0.3.2, commit 6d7b6ec8). SVs that overlapped any of the significantly mutated genes were manually curated for inclusion as non-synonymous mutations. IG-MYC translocations were identified as being any SV that met the following conditions: (1) one breakpoint was near MYC (chr8:126393182-130762146); (2) the breakpoint near MYC was oriented such that exons 2 and 3 are included in the rearrangement; (3) the other breakpoint was near an immunoglobulin heavy or light chain locus, namely IGH (chr14:104589639-107810399), IGK (chr2:87999518-90599757), or IGL (chr22:21031465-23905532); and (4) the highest-scoring translocation was selected in the event of multiple candidate SVs. Tumors where Manta failed to detect a translocation that met the above criteria were manually inspected for such events, which successfully revealed IG-MYC rearrangements in the remaining cases.

Somatic copy number variations 

Sequenza was used to call somatic copy number variations (CNVs) in tumor-normal pairs. Sequenza bam2seqz (parameters: –qlimit 30) generated the SEQZ files, which were then binned using Sequenza seqz-binning (parameters: -w 300 -s). To eliminate noise, germline heterozygous positions were filtered for any that overlap dbSNP (downloaded 2017-04-03) “common all” single nucleotide polymorphisms (SNPs). Using bedtools intersect (parameters: -wa), germline heterozygous positions were removed if they overlapped gaps in the reference genome (e.g. centromeres) or segmental duplications, which were obtained from the UCSC Table Browser. Previously, the segmental duplications were merged if they overlapped one another using bedtools merge, then filtered for a minimum size of 10 kbp, and subsequently merged again using bedtools merge (parameters: -d 10000). The Sequenza R package was used to load the binned SEQZ data, fit a model for cellularity and ploidy, and generate CNV segments. Sequenza was made aware of the sex of each case to properly handle CNVs on the sex chromosomes. To simplify model fitting and avoid incorrect local optima, ploidy and cellularity options were restricted as follows. Ploidy was limited to the range between 1.8 and 2.5. Cellularity was restricted to an estimate of tumor content derived from the variant allele fraction (VAF) of SNVs and indels, defined as twice the VAF corresponding to the first local density maximum below 50%.

Clonal B-cell receptors 

MiXCR (version 2.1.3) was used to identify immunoglobulin heavy and light chain clones from the RNA-seq and WGS data as per the standard pipeline described in their documentation. The MiXCR pipeline was also run on 323 DLBCL tumor samples that underwent a strand-specific poly[A]-selection RNA-seq protocol.[Ennishi_undated-wu] All RNA-seq reads were aligned using “mixcr align” (parameters: -p rna-seq -OallowPartialAlignments=true) while for the WGS data, only reads originating from the immunoglobulin regions (chr2:88668078-90584447, chr14:105548159-107030529, and chr22:21897318-23046831) or unmapped reads were aligned using “mixcr align” (parameters: -p rna-seq -OallowPartialAlignments=true -OvParameters.geneFeatureToAlign=VGeneWithP). Two rounds of contig assembly was performed using “mixcr assemblePartial” followed by clone assembly using “mixcr assemble”. Clones were exported using “mixcr exportClones” (parameters: -o -t) options to exclude any clones with out-of-frame sequences or stop codons. Clonal fraction was calculated for heavy and light chains separately. Dominant clones in the RNA-seq data were defined as having a clonal fraction of at least 30% with a minimum of 30 supporting reads. For the WGS analysis, dominant clones were defined as having the greatest clonal fraction with at least two supporting reads. The top-scoring V, D, J and C genes were selected for each clone when multiple genes were possible.

Transcriptome Sequencing

Data Generation Protocols Data Analysis Protocols
Data Generation Protocols Data Analysis Protocols

Data Generation Protocols

The data generation protocols for the Burkitt Lymphoma (BL) project were acquired from the following manuscript.

Grande BM, Gerhard DS, Jiang A, et al. Genome-wide discovery of somatic coding and non-coding mutations in pediatric endemic and sporadic Burkitt lymphoma. Blood. March 2019; 21;133(12):1313-1324. (PMID: 30617194)

Strand-specific ribosomal RNA depletion RNA sequencing 

RNA sequencing (RNA-seq) libraries were constructed from RNA provided by Nationwide Children’s Hospital (Columbus, OH) using a strand-specific ribosomal depletion protocol. To remove cytoplasmic and mitochondrial ribosomal RNA (rRNA) species from total RNA NEBNext rRNA Depletion Kit for Human/Mouse/Rat was used (NEB, E6310X). Enzymatic reactions were set-up in a 96-well plate (Thermo Fisher Scientific) on a Microlab NIMBUS liquid handler (Hamilton Robotics, USA). 100ng of DNase I treated total RNA in 6 µL was hybridized to rRNA probes in a 7.5 µL reaction. Heat-sealed plates were incubated at 95°C for 2 minutes followed by incremental reduction in temperature by 0.1°C per second to 22°C (730 cycles). The rRNA in DNA hybrids were digested using RNase H in a 10 µL reaction incubated in a thermocycler at 37°C for 30 minutes. To remove excess rRNA probes (DNA) and residual genomic DNA contamination, DNase I was added in a total reaction volume of 25 µL and incubated at 37°C for 30 minutes. RNA was purified using RNA MagClean DX beads (Aline Biosciences, USA) with 15 minutes of binding time, 7 minutes clearing on a magnet followed by two 70% ethanol washes, 5 minutes to air dry the RNA pellet and elution in 36 uL DEPC water. The plate containing RNA was stored at -80°C prior to cDNA synthesis.

First-strand cDNA was synthesized from the purified RNA (minus rRNA) using the Maxima H Minus First Strand cDNA Synthesis kit (Thermo-Fisher, USA) and random hexamer primers at a concentration of 8ng/µL along with a final concentration of 0.4µg/µL Actinomycin D, followed by PCR Clean DX bead purification on a Microlab NIMBUS robot (Hamilton Robotics, USA). The second strand cDNA was synthesized following the NEBNext Ultra Directional Second Strand cDNA Synthesis protocol (NEB) that incorporates dUTP in the dNTP mix, allowing the second strand to be digested using USERTM enzyme (NEB) in the post-adapter ligation reaction and thus achieving strand specificity.

cDNA was fragmented by Covaris LE220 sonication for 130 seconds (2 x 65 seconds) at a “Duty cycle” of 30%, 450W Peak Incident Power and 200 Cycles per Burst in a 96-well microTUBE Plate (P/N: 520078) to achieve 200-250 bp average fragment lengths. The paired-end sequencing library was prepared following the BC Cancer Agency Genome Sciences Centre strand-specific, plate-based library construction protocol on a Microlab NIMBUS robot (Hamilton Robotics, USA). Briefly, the sheared cDNA was subject to end-repair and phosphorylation in a single reaction using an enzyme premix (NEB) containing T4 DNA polymerase, Klenow DNA Polymerase and T4 polynucleotide kinase, incubated at 20°C for 30 minutes. Repaired cDNA was purified in 96-well format using PCR Clean DX beads (Aline Biosciences, USA), and 3’ A-tailed (adenylation) using Klenow fragment (3’ to 5’ exo minus) and incubation at 37°C for 30 minutes prior to enzyme heat inactivation. Illumina PE adapters were ligated at 20°C for 15 minutes. The adapter-ligated products were purified using PCR Clean DX beads, then digested with USERTM enzyme (1 U/µL, NEB) at 37°C for 15 minutes followed immediately by 13 cycles of indexed PCR using Phusion DNA Polymerase (Thermo Fisher Scientific Inc. USA) and Illumina’s PE primer set. PCR parameters: 98°C for 1 minute followed by 13 cycles of 98°C 15 seconds, 65°C 30 seconds and 72°C 30 seconds, and then 72°C 5 minutes. The PCR products were purified and size-selected using a 1:1 PCR Clean DX beads-to-sample ratio (twice), and the eluted DNA quality was assessed with Caliper LabChip GX for DNA samples using the High Sensitivity Assay (PerkinElmer, Inc. USA) and quantified using a Quant-iT dsDNA High Sensitivity Assay Kit on a Qubit fluorometer (Invitrogen) prior to library pooling and size-corrected final molar concentration calculation for Illumina HiSeq2500 sequencing with paired-end 75 base reads.

miRNA Sequencing

miRNA sequencing (miRNA-seq) libraries were constructed from 1 μg total RNA provided by Nationwide Children’s Hospital (Columbus, OH) using a plate-based protocol developed at the British Columbia Cancer, Genome Sciences Centre (BCGSC). Negative controls were added at three stages: elution buffer was added to one well when the total RNA was loaded onto the plate, water to another well just before ligating the 3’ adapter, and PCR brew mix to a final well just before PCR amplification. A 3’ adapter was ligated using a truncated T4 RNA ligase2 (NEB 212 Canada, cat. M0242L) with an incubation at 22°C for 1 hour. This adapter is an adenylated, single-stranded DNA with the sequence 5’ /5rApp/ ATCTCGTATGCCGTCTTCTGCTTGT/3ddC/, which selectively ligates to miRNAs. An RNA 5’ adapter was then ligated, using T4 RNA ligase (Ambion USA, cat. AM2141) and ATP, and was incubated at 37°C for 1 hour. The sequence of the single strand RNA adapter is 5’GUUCAGAGUUCUACAGUCCGACGAUCUGGUCAA3’.

Upon completion of adapter ligation, 1st strand cDNA was synthesized using Superscript II Reverse Transcriptase (Invitrogen, cat.18064 014) and RT primer (5’-221 CAAGCAGAAGACGGCATACGAGAT-3’). First-strand cDNA provided the template for the final library PCR, into which we introduce index sequences to enable libraries to be identified from a sequenced pool that contains multiple libraries. Briefly, a PCR brew mix was made with the 3’ PCR primer (5’-CAAGCAGAAGACGGCATACGAGAT-3’), Phusion Hot Start HighFidelity DNA polymerase (NEB Canada, cat. F-540L), buffer, dNTPs and DMSO. The mix was distributed evenly into a new 96-well plate. A Microlab NIMBUS robot (Hamilton Robotics, USA) was used to transfer the PCR template (1st strand cDNA) and indexed 5’ PCR primers into the brew mix plate. Each indexed 5’ PCR primer, 5’AATGATACGGCGACCACCGACAGNNNNNNGTTCAGAGTTCTACAGTCCGA-3’,contains a unique six-nucleotide ‘index’ (shown here as N’s), and was added to each well of the 96-well PCR brew plate. PCR was performed at 98°C for 30 seconds, followed by 15 cycles of 98°C for 15 seconds, 62°C for 30 seconds and 72°C for 15 seconds, and finally a 5 minute incubation at 72°C. Library qualities were assessed across the whole plate using a Caliper LabChipGX DNA chip. PCR products were pooled and size selected to remove larger cDNA fragments and smaller adapter contaminants, using a 96-channel 235 automated size selection robot that was developed at the BCGSC. After size selection, each pool was ethanol precipitated, quality checked using an Agilent Bioanalyzer DNA1000 chip and quantified using a Qubit fluorometer (Invitrogen, cat. Q32854). Each pool was diluted to a target concentration for cluster generation and loaded into a single lane of an Illumina HiSeq2500 flow cell. Clusters were generated, and lanes were sequenced with a 31-bp main read for the insert and a 7-bp read for the index.

Experimental Protocols

To request more information or approval regarding the following protocols, please contact BC Cancer at labqa@bcgsc.ca.

96-well Plate-based Strand-specific cDNA Synthesis using Maxima H Minus on Hamilton NIMBUS

Nimbus-assisted 96-well PCR-enriched Library Construction for Illumina Sequencing

Plate-based rRNA depletion

miRNA3 – Plate Format miRNA Library Construction

 

Data Analysis Protocol

The data analysis protocols for the Burkitt Lymphoma (BL) project were acquired from the following manuscript.

Grande BM, Gerhard DS, Jiang A, et al. Genome-wide discovery of somatic coding and non-coding mutations in pediatric endemic and sporadic Burkitt lymphoma. Blood. March 2019; 21;133(12):1313-1324. (PMID: 30617194)

Gene expression quantification 

The tximport Bioconductor R package was used to summarize transcript-level read counts at the gene level. The DESeq2 Bioconductor R package was used to correct the read counts for library size and to perform a variance-stabilizing data transformation. These variance-stabilized expression values were used for statistical tests that require homoskedastic data.

Targeted Sequencing

Data Generation Protocols Data Analysis Protocols
Data Generation Protocols Data Analysis Protocols

Data Generation Protocol

The data generation protocols for the Burkitt Lymphoma (BL) project were acquired from the following manuscript.

Grande BM, Gerhard DS, Jiang A, et al. Genome-wide discovery of somatic coding and non-coding mutations in pediatric endemic and sporadic Burkitt lymphoma. Blood. March 2019; 21;133(12):1313-1324. (PMID: 30617194)

Targeted sequencing by custom hybridization capture

Targeted sequencing libraries were constructed from DNA provided by Nationwide Children’s Hospital (Columbus, OH) using a custom hybridization capture protocol. 50 ng from each of 20 or 21 whole genome libraries was pooled prior to custom capture using Agilent SureSelect XT Custom probes (4.8 Mbp) targeting 74,809 human and EBV features (https://cgci-data.nci.nih.gov/PreRelease/BLGSP/targeted_capture_sequencing/DESIGN/). The features included the following: exons of recurrently mutated genes with the exception of known targets of passenger mutations (e.g. TTN, mucin genes); exons of several known diffuse large B-cell lymphoma (DLBCL) genes; exons of previously reported BL genes not found mutated in our data; whole gene bodies for DDX3X (GRCh38 chrX:41332775-41364961) and FBXO11 (chr2:47782639-47907718); whole gene bodies and flanking regions for ID3 (chr1:23557918-23657826) and BCL6 (chr3:187718649-188265924); the recurrently rearranged region surrounding MYC (chr8:127242368-129788153); and non-coding mutation peaks (details below). The pooled libraries were hybridized to the RNA probes at 65°C for 24 hours. Following hybridization, streptavidin-coated magnetic beads (Dynal, MyOne) were used for custom capture. Post-capture material was purified on MinElute columns (Qiagen) followed by post-capture enrichment with 10 cycles of PCR using primers that maintain the library-specific indices. Pooled libraries were sequenced on an Illumina HiSeq 2500 instruments with v4 chemistry generating 125 base paired-end reads.

Experimental Protocols

To request more information or approval regarding the following protocol, please contact BC Cancer at labqa@bcgsc.ca.

Multiplex Agilent Whole Exome or Target Capture

 

Data Analysis Protocol

The data analysis protocols for the Burkitt Lymphoma (BL) project were acquired from the following manuscript.

Grande BM, Gerhard DS, Jiang A, et al. Genome-wide discovery of somatic coding and non-coding mutations in pediatric endemic and sporadic Burkitt lymphoma. Blood. March 2019; 21;133(12):1313-1324. (PMID: 30617194)

Sequencing read alignment

Whole genome sequencing (WGS) and targeted sequencing reads were aligned to the human reference genome (GRCh38) with BWA-MEM (version 0.7.6a; parameters: -M). The human reference genome that was used is a version of GRCh38 without alternate contigs that includes the Epstein–Barr viral genome (GenBank accession AJ507799.2), which can be downloaded at http://www.bcgsc.ca/downloads/genomes/9606/hg38_no_alt/bwa_0.7.6a_ind/genome/. Read duplicate marking was done using sambamba (version 0.5.5). The WGS read alignments for the discovery tumor and matched normal had an average non-redundant depth of 82X (range 55-96) and 41X (range 30-51), respectively. The validation tumor and normal samples were sequenced to a higher average depth, namely 243X (range 158-392). Tumor and matched normal WGS data for 15 cases from the International Cancer Genomic Consortium (ICGC) were obtained through a DACO-approved project using a virtual instance on the Cancer Genome Collaboratory. The ICGC WGS reads were re-aligned using the above parameters, yielding alignments with an average depth of 40X (range 29-62). RNA sequencing (RNA-seq) reads (mean 200M reads; range 100-289M) were pseudo-aligned using Salmon (version 0.8.2; details below). The RNA-seq reads were also aligned to the reference genome indicated above using the JAGuaR pipeline. miRNA sequencing (miRNA-seq) reads (mean 13M reads; range 1.8-35M) were aligned to the same human reference genome with BWA-SW (version 0.5.7).

Sequencing alignment bed files can be found here.

Analysis of Epstein-Barr Virus in Pediatric Burkitt Lymphoma

Data Analysis Protocols
Data Analysis Protocol

The Epstein-Barr virus (EBV) sequences are available for download as BAM alignments from the public directory at the OCG Data Coordinating Center: https://cgci-data.nci.nih.gov/Public/BLGSP/WGS/L2/. These 106 open access EBV BAM files were extracted from the BLGSP patient genomes included in the following publication:

Grande BM, Gerhard DS, Jiang A, et al. Genome-wide discovery of somatic coding and non-coding mutations in pediatric endemic and sporadic Burkitt lymphoma. Blood. March 2019; 21;133(12):1313-1324. (PMID: 30617194)

Tumor EBV status and genome type 

Tumor EBV status and genome type was inferred from tumor whole genome sequencing (WGS) and RNA sequencing (RNA-seq) data. To determine tumor EBV status, the fraction of reads aligning to the EBV genome was calculated using Samtools (version 1.6). Tumors were considered to be EBV-positive when the EBV fraction of WGS reads was greater than 0.00006 (calculated from the fraction represented by the EBV genome in the reference genome) and the number of RNA-seq reads mapped to the EBER1 (chrEBV:6629-6795) and EBER2 (chrEBV:6956-7128) loci in the JAGuaR-based alignments was greater than 250. There were no cases with discordant EBV statuses inferred from the WGS and RNA-seq data. Although EBER expression was not quantified for the ICGC tumors because their RNA-seq data were not used in this project, they were all classified as EBV-negative according to their WGS data, which is consistent with the EBV status reported by the MMML-seq project. The minimum fraction of EBV reads was 0.01 for samples that underwent targeted sequencing to account for the different ratio of human and EBV genomic regions due to hybridization capture. EBV genome type was inferred for EBV-positive tumors by comparing the counts for 21-mers that are unique to either EBV type 1 (GenBank accession NC_007605.1) or type 2 (GenBank accession NC_009334.1). K-mer counting was performing on tumor WGS reads aligned to the EBV genome using Jellyfish (version 2.2.6). EBV genome type was inferred to be type 1 or type 2 if the count ratio of EBV type 1–specific k-mers to EBV type 2–specific k-mers was greater than or lesser than 1, respectively.