mRNA Sequencing

Sequencing Center Data Generation Protocols Data Analysis Protocols
British Columbia Cancer Agency (BCCA) ALLALL symbol , AMLAML symbol , NBLNeuroblastoma symbol , RTKidney tumor symbol , WTKidney tumor symbol , ALALALL symbol ALLALL symbol , AMLAML symbol , NBLNeuroblastoma symbol , RTKidney tumor symbol , WTKidney tumor symbol , ALALALL symbol
NCI Center for Cancer Research CCSKKidney tumor symbol , NBLNeuroblastoma symbol CCSKKidney tumor symbol , NBLNeuroblastoma symbol
St. Jude Children’s Research Hospital (SJCRH) ALALALL symbol

RNA-Seq (plate-based) library construction (pre-2014):

2-3 ug total RNA samples were arrayed into a 96-well plate and polyadenylated (PolyA+) RNA was purified using the 96-well MultiMACS mRNA isolation kit on the MultiMACS 96 separator (Miltenyi Biotec, Germany) with on column DNaseI-treatment as per the manufacturer's instructions. The eluted PolyA+ RNA was ethanol precipitated and resuspended in 10µL of DEPC treated water with 1:20 SuperaseIN (Life Technologies, USA). Double-stranded cDNA was synthesized from the purified polyA+RNA using the Superscript Double-Stranded cDNA Synthesis kit (Life Technologies, USA) and random hexamer primers at a concentration of 5µM. The cDNA was quantified in a 96-well format using PicoGreen (Life Technologies, USA) and VICTOR3V Spectrophotometer (PerkinElmer, Inc. USA). The quality was checked for a random sampling on an Agilent Bioanalyzer using the High Sensitivity DNA chip Assay.  cDNA was fragmented by Covaris E210 (Covaris, USA) sonication for 55 seconds, a “Duty cycle” of 20% and “Intensity” of 5. Plate-based libraries were prepared following the BC Cancer Agency, Genome Sciences Centre paired-end (PE) protocol on a Biomek FX robot (Beckman-Coulter, USA). Briefly, the cDNA was purified in 96-well format using Ampure XP SPRI beads, and was subject to end-repair and phosphorylation by T4 DNA polymerase, Klenow DNA Polymerase, and T4 polynucleotide kinase respectively in a single reaction, followed by cleanup using Ampure XP SPRI beads and 3’ A-tailling by Klenow fragment (3’ to 5’ exo minus). After cleanup using Ampure XP SPRI beads, picogreen quantification was performed to determine the amount of Illumina PE adapters used in the next step of adapter ligation reaction. The adapter-ligated products were purified using Ampure XP SPRI beads, then PCR-amplified with Phusion DNA Polymerase (Thermo Fisher Scientific Inc. USA) using Illumina’s PE primer set,  with cycle condition 98˚C  30sec followed by 10-15 cycles of 98˚C  10 sec, 65˚C  30 sec and 72˚C  30 sec, and then 72˚C  5min. The PCR products were purified using Ampure XP SPRI beads, and checked with Caliper LabChip GX for DNA samples using the High Sensitivity Assay (PerkinElmer, Inc. USA). PCR product of desired size range was purified using an in-house 96-channel size selection robot, and the DNA quality was assessed and quantified using an Agilent DNA 1000 series II assay and Quant-iT dsDNA HS Assay Kit using Qubit fluorometer (Invitrogen), then diluted to 8nM. The final concentration was verified by Quant-iT dsDNA HS Assay prior to Illumina HiSeq2000 PE 75 base sequencing.

Strand-specific RNA-seq (plate based) library construction (post-2014):

Total RNA samples were checked using an Agilent Bioanalyzer RNA nanochip or Caliper GX HT RNA LabChip, and samples passing quality control were arrayed into a 96-well plate. PolyA+ RNA was purified using the 96-well MultiMACS mRNA isolation kit on the MultiMACS 96 separator (Miltenyi Biotec, Germany) from 2ug total RNA with on-column DNaseI-treatment as per the manufacturer's instructions. The eluted PolyA+ RNA was ethanol precipitated and resuspended in 10µL of DEPC treated water with 1:20 SuperaseIN (Life Technologies, USA).

 

First-stranded cDNA was synthesized from the purified polyA+RNA using the Superscript cDNA Synthesis kit (Life Technologies, USA) and random hexamer primers at a concentration of 5µM along with a final concentration of 1ug/uL Actinomycin D, followed by Ampure XP SPRI beads on a Biomek FX robot (Beckman-Coulter, USA). The second strand cDNA was synthesized following the Superscript cDNA Synthesis protocol by replacing the dTTP with dUTP in dNTP mix, allowing second strand to be digested using UNG (Uracil-N-Glycosylase, Life Technologies, USA) in the post-adapter ligation reaction and thus achieving strand specificity.

The cDNA was quantified in a 96-well format using PicoGreen (Life Technologies, USA) and VICTOR3V Spectrophotometer (PerkinElmer, Inc. USA). The cDNA was fragmented by Covaris E210 sonication for 55 seconds at a “Duty cycle” of 20% and “Intensity” of 5. The paired-end sequencing library was prepared following the BC Cancer Agency Genome Sciences Centre strand-specific, plate-based and paired-end library construction protocol on a Biomek FX robot (Beckman-Coulter, USA). Briefly, the cDNA was purified in 96-well format using Ampure XP SPRI beads, and was subject to end-repair, and phosphorylation by T4 DNA polymerase, Klenow DNA Polymerase, and T4 polynucleotide kinase respectively in a single reaction, followed by cleanup using Ampure XP SPRI beads and 3’ A-tailing by Klenow fragment (3’ to 5’ exo minus). After purification using Ampure XP SPRI beads, picogreen quantification was performed to determine the amount of Illumina PE adapters to be used in the next step of adapter ligation reaction. The adapter-ligated products were purified using Ampure XP SPRI beads, and digested with UNG (1U/ul) at 37oC for 30 min followed by deactivation at 95oC for 15 min. The digested cDNA was purified using Ampure XP SPRI beads, and then PCR-amplified with Phusion DNA Polymerase (Thermo Fisher Scientific Inc. USA) using Illumina’s PE primer set,  with cycle condition 98˚C  30sec followed by 10-13 cycles of 98˚C  10 sec, 65˚C  30 sec and 72˚C  30 sec, and then 72˚C  5min. The PCR products were purified using Ampure XP SPRI beads, and checked with Caliper LabChip GX for DNA samples using the High Sensitivity Assay (PerkinElmer, Inc. USA). PCR product of the desired size range was purified using 8% PAGE, and the DNA quality was assessed and quantified using an Agilent DNA 1000 series II assay and Quant-iT dsDNA HS Assay Kit using Qubit fluorometer (Invitrogen), then diluted to 8nM. The final library concentration was double checked and determined by Quant-iT dsDNA HS Assay again for Illumina Sequencing.

RNA-Seq/hg19 read alignment:

Illumina paired-end RNA sequencing reads were aligned to GRCh37-lite genome-plus-junctions reference using BWA version 0.5.7. This reference combined genomic sequences in the GRCh37-lite assembly and exon-exon junction sequences whose corresponding coordinates were defined based on annotations of any transcripts in Ensembl (v59), Refseq and known genes from the UCSC genome browser, which was downloaded on August 19 2010, August 8 2010, and August 19 2010, respectively. Reads that mapped to junction regions were then repositioned back to the genome, and were marked with 'ZJ:Z' tags. BWA is run using default parameters, except that the option (-s) is included to disable Smith-Waterman alignment.

Finally, reads failing the Illumina chastity filter are flagged with a custom script, and duplicated reads were flagged with Picard Tools.

 

Structural variant detection

Was performed using ABySS (v1.3.2) and trans-ABySS (v1.4.6). For RNA-seq assembly alternate k-mers from k50-k96 were performed using positive strand and ambiguous stand reads as well as negative strand and ambiguous strand reads. The positive and negative strand assemblies were extended where possible, merged and then concatenated together to produce a meta-assembly contig dataset. The genome (WGS) libraries were assembled in single end mode using k-mer values of k24, and k44. The contigs and reads were then reassembled at k64 in single end mode and then finally at k64 in paired end mode. The meta-assemblies were then used as input to the trans-ABySS analysis pipeline (Robertson et al., 2010).

Large scale rearrangements and gene fusions from RNA-seq libraries were identified from contigs that had high confidence GMAP (v2012-12-20) alignments to two distinct genomic regions.  Evidence for the alignments were provided from aligning reads back to the contigs and from aligning reads to genomic coordinates. Events were then filtered on read thresholds. Large scale rearrangements and gene fusions from WGS libraries were identified in a similar way, but using BWA (v0.6.2-r126) alignments.

Insertions and deletions were identified by gapped alignment of contigs to the human reference using GMAP for RNA-seq and BWA for WGS. Confidence in the event was calculated from the alignment of reads back to the event breakpoint in the contigs.  The events were then screened against dbSNP and other variation databases to identify putative novel events.

To determine compartment specific events the structural variant calls for each patient from all matched genome and RNA-seq samples were concatenated together and screened against matching genome tumour, and where available germline bam files. This resulted in compartment specific structural variant events and where germline was available putative somatic and germline events. The events were further filtered against a compendium of germline structural variants to remove recurrent false positives.

SNV analysis of strand-specific RNA-seq data:

After repositioning, hg19-aligned BAM files were split into positive-fragment and negative-fragment BAM files based on the orientation of the paired-end reads. Unmapped and improperly paired aligned reads were put into the mix-fragment BAM. SNVs were then detected on positive- and negative-split BAMs separately using SNVMix2 (Goya et al., 2010) with parameters Mb and Q30.  The SNVs were further filtered to exclude those called based on 1) reference base N; 2) only 1 read supports the variant; 3) probability of heterozygous and homozygous of variant allele smaller than 0.99; 4) a position overlapping with insertions or deletions; 5) read supports from positions no more than 5 bases from read ends; 6) supports from reads only spanning an exon-exon junction; 7) more than 0.5 proportion of supporting reads were improper paired; 8) fewer than 2 proper-paired supporting reads.  SNVs located in exons equal or smaller than the read length, 100bp in this case, are a special case, because all their coverage may come from exon-exon junction spanning reads, so we also identified small-exonic SNVs that ware only supported by reads that spanning exon-exon junction but passed all other 7 filtering criteria mentioned above. These SNVs were finally annotated with SnpEff (Cingolani et al., 2012b) (Ensembl 66) and SnpSift (Cingolani et al., 2012a) (dbSNP137 and COSMIC64).

mRNA-Differential expression:

We used SAMseq (samr v2.0, R 2.15.0) two-class unpaired analyses with an FDR threshold of 0.05 to identify genes that were differentially expressed. For each run on a pair of sample groups, we first reduced the number of genes by removing those with median less than 5 RPKM in both groups, and those for which the Wilcoxon BH adjusted P-value between the two groups was greater than 0.05. This subset of genes was submitted to SAMseq. Each run generated a pair of files: genes ‘up’ and ‘down’. We then ranked the genes by a median-based fold change, and generated a figure showing up to 10 of the largest fold changes in each direction.

mRNA-NMF:

For specific mRNA-Seq expression datasets, we first removed genes expressed at or below a noise threshold of ≤ 0.2 reads per kilobase (of gene model) per million mapped reads (RPKM) in at least 75% of samples. We created the NMF input matrix using the top 25% most-variant genes, by ranking expressed genes having a mean RPKM of at least 10 by the coefficient of variation. We generated consensus clustering results with NMF v0.5.02 in R v1.12.0, with the default Brunet algorithm, and 200 iterations for the clustering run. Rank survey profiles for cophenetic and silhouette width suggest a specific cluster solution.

RNA-Seq (plate-based) library construction (pre-2014):

2-3 ug total RNA samples were arrayed into a 96-well plate and polyadenylated (PolyA+) RNA was purified using the 96-well MultiMACS mRNA isolation kit on the MultiMACS 96 separator (Miltenyi Biotec, Germany) with on column DNaseI-treatment as per the manufacturer's instructions. The eluted PolyA+ RNA was ethanol precipitated and resuspended in 10µL of DEPC treated water with 1:20 SuperaseIN (Life Technologies, USA). Double-stranded cDNA was synthesized from the purified polyA+RNA using the Superscript Double-Stranded cDNA Synthesis kit (Life Technologies, USA) and random hexamer primers at a concentration of 5µM. The cDNA was quantified in a 96-well format using PicoGreen (Life Technologies, USA) and VICTOR3V Spectrophotometer (PerkinElmer, Inc. USA). The quality was checked for a random sampling on an Agilent Bioanalyzer using the High Sensitivity DNA chip Assay.  cDNA was fragmented by Covaris E210 (Covaris, USA) sonication for 55 seconds, a “Duty cycle” of 20% and “Intensity” of 5. Plate-based libraries were prepared following the BC Cancer Agency, Genome Sciences Centre paired-end (PE) protocol on a Biomek FX robot (Beckman-Coulter, USA). Briefly, the cDNA was purified in 96-well format using Ampure XP SPRI beads, and was subject to end-repair and phosphorylation by T4 DNA polymerase, Klenow DNA Polymerase, and T4 polynucleotide kinase respectively in a single reaction, followed by cleanup using Ampure XP SPRI beads and 3’ A-tailling by Klenow fragment (3’ to 5’ exo minus). After cleanup using Ampure XP SPRI beads, picogreen quantification was performed to determine the amount of Illumina PE adapters used in the next step of adapter ligation reaction. The adapter-ligated products were purified using Ampure XP SPRI beads, then PCR-amplified with Phusion DNA Polymerase (Thermo Fisher Scientific Inc. USA) using Illumina’s PE primer set,  with cycle condition 98˚C  30sec followed by 10-15 cycles of 98˚C  10 sec, 65˚C  30 sec and 72˚C  30 sec, and then 72˚C  5min. The PCR products were purified using Ampure XP SPRI beads, and checked with Caliper LabChip GX for DNA samples using the High Sensitivity Assay (PerkinElmer, Inc. USA). PCR product of desired size range was purified using an in-house 96-channel size selection robot, and the DNA quality was assessed and quantified using an Agilent DNA 1000 series II assay and Quant-iT dsDNA HS Assay Kit using Qubit fluorometer (Invitrogen), then diluted to 8nM. The final concentration was verified by Quant-iT dsDNA HS Assay prior to Illumina HiSeq2000 PE 75 base sequencing.

Strand-specific RNA-seq (plate based) library construction (post-2014):

Total RNA samples were checked using an Agilent Bioanalyzer RNA nanochip or Caliper GX HT RNA LabChip, and samples passing quality control were arrayed into a 96-well plate. PolyA+ RNA was purified using the 96-well MultiMACS mRNA isolation kit on the MultiMACS 96 separator (Miltenyi Biotec, Germany) from 2ug total RNA with on-column DNaseI-treatment as per the manufacturer's instructions. The eluted PolyA+ RNA was ethanol precipitated and resuspended in 10µL of DEPC treated water with 1:20 SuperaseIN (Life Technologies, USA).

 

First-stranded cDNA was synthesized from the purified polyA+RNA using the Superscript cDNA Synthesis kit (Life Technologies, USA) and random hexamer primers at a concentration of 5µM along with a final concentration of 1ug/uL Actinomycin D, followed by Ampure XP SPRI beads on a Biomek FX robot (Beckman-Coulter, USA). The second strand cDNA was synthesized following the Superscript cDNA Synthesis protocol by replacing the dTTP with dUTP in dNTP mix, allowing second strand to be digested using UNG (Uracil-N-Glycosylase, Life Technologies, USA) in the post-adapter ligation reaction and thus achieving strand specificity.

The cDNA was quantified in a 96-well format using PicoGreen (Life Technologies, USA) and VICTOR3V Spectrophotometer (PerkinElmer, Inc. USA). The cDNA was fragmented by Covaris E210 sonication for 55 seconds at a “Duty cycle” of 20% and “Intensity” of 5. The paired-end sequencing library was prepared following the BC Cancer Agency Genome Sciences Centre strand-specific, plate-based and paired-end library construction protocol on a Biomek FX robot (Beckman-Coulter, USA). Briefly, the cDNA was purified in 96-well format using Ampure XP SPRI beads, and was subject to end-repair, and phosphorylation by T4 DNA polymerase, Klenow DNA Polymerase, and T4 polynucleotide kinase respectively in a single reaction, followed by cleanup using Ampure XP SPRI beads and 3’ A-tailing by Klenow fragment (3’ to 5’ exo minus). After purification using Ampure XP SPRI beads, picogreen quantification was performed to determine the amount of Illumina PE adapters to be used in the next step of adapter ligation reaction. The adapter-ligated products were purified using Ampure XP SPRI beads, and digested with UNG (1U/ul) at 37oC for 30 min followed by deactivation at 95oC for 15 min. The digested cDNA was purified using Ampure XP SPRI beads, and then PCR-amplified with Phusion DNA Polymerase (Thermo Fisher Scientific Inc. USA) using Illumina’s PE primer set,  with cycle condition 98˚C  30sec followed by 10-13 cycles of 98˚C  10 sec, 65˚C  30 sec and 72˚C  30 sec, and then 72˚C  5min. The PCR products were purified using Ampure XP SPRI beads, and checked with Caliper LabChip GX for DNA samples using the High Sensitivity Assay (PerkinElmer, Inc. USA). PCR product of the desired size range was purified using 8% PAGE, and the DNA quality was assessed and quantified using an Agilent DNA 1000 series II assay and Quant-iT dsDNA HS Assay Kit using Qubit fluorometer (Invitrogen), then diluted to 8nM. The final library concentration was double checked and determined by Quant-iT dsDNA HS Assay again for Illumina Sequencing.

RNA-Seq/hg19 read alignment:

Illumina paired-end RNA sequencing reads were aligned to GRCh37-lite genome-plus-junctions reference using BWA version 0.5.7. This reference combined genomic sequences in the GRCh37-lite assembly and exon-exon junction sequences whose corresponding coordinates were defined based on annotations of any transcripts in Ensembl (v59), Refseq and known genes from the UCSC genome browser, which was downloaded on August 19 2010, August 8 2010, and August 19 2010, respectively. Reads that mapped to junction regions were then repositioned back to the genome, and were marked with 'ZJ:Z' tags. BWA is run using default parameters, except that the option (-s) is included to disable Smith-Waterman alignment.

Finally, reads failing the Illumina chastity filter are flagged with a custom script, and duplicated reads were flagged with Picard Tools.

 

Structural variant detection

Was performed using ABySS (v1.3.2) and trans-ABySS (v1.4.6). For RNA-seq assembly alternate k-mers from k50-k96 were performed using positive strand and ambiguous stand reads as well as negative strand and ambiguous strand reads. The positive and negative strand assemblies were extended where possible, merged and then concatenated together to produce a meta-assembly contig dataset. The genome (WGS) libraries were assembled in single end mode using k-mer values of k24, and k44. The contigs and reads were then reassembled at k64 in single end mode and then finally at k64 in paired end mode. The meta-assemblies were then used as input to the trans-ABySS analysis pipeline (Robertson et al., 2010).

Large scale rearrangements and gene fusions from RNA-seq libraries were identified from contigs that had high confidence GMAP (v2012-12-20) alignments to two distinct genomic regions.  Evidence for the alignments were provided from aligning reads back to the contigs and from aligning reads to genomic coordinates. Events were then filtered on read thresholds. Large scale rearrangements and gene fusions from WGS libraries were identified in a similar way, but using BWA (v0.6.2-r126) alignments.

Insertions and deletions were identified by gapped alignment of contigs to the human reference using GMAP for RNA-seq and BWA for WGS. Confidence in the event was calculated from the alignment of reads back to the event breakpoint in the contigs.  The events were then screened against dbSNP and other variation databases to identify putative novel events.

To determine compartment specific events the structural variant calls for each patient from all matched genome and RNA-seq samples were concatenated together and screened against matching genome tumour, and where available germline bam files. This resulted in compartment specific structural variant events and where germline was available putative somatic and germline events. The events were further filtered against a compendium of germline structural variants to remove recurrent false positives.

SNV analysis of strand-specific RNA-seq data:

After repositioning, hg19-aligned BAM files were split into positive-fragment and negative-fragment BAM files based on the orientation of the paired-end reads. Unmapped and improperly paired aligned reads were put into the mix-fragment BAM. SNVs were then detected on positive- and negative-split BAMs separately using SNVMix2 (Goya et al., 2010) with parameters Mb and Q30.  The SNVs were further filtered to exclude those called based on 1) reference base N; 2) only 1 read supports the variant; 3) probability of heterozygous and homozygous of variant allele smaller than 0.99; 4) a position overlapping with insertions or deletions; 5) read supports from positions no more than 5 bases from read ends; 6) supports from reads only spanning an exon-exon junction; 7) more than 0.5 proportion of supporting reads were improper paired; 8) fewer than 2 proper-paired supporting reads.  SNVs located in exons equal or smaller than the read length, 100bp in this case, are a special case, because all their coverage may come from exon-exon junction spanning reads, so we also identified small-exonic SNVs that ware only supported by reads that spanning exon-exon junction but passed all other 7 filtering criteria mentioned above. These SNVs were finally annotated with SnpEff (Cingolani et al., 2012b) (Ensembl 66) and SnpSift (Cingolani et al., 2012a) (dbSNP137 and COSMIC64).

mRNA-Differential expression:

We used SAMseq (samr v2.0, R 2.15.0) two-class unpaired analyses with an FDR threshold of 0.05 to identify genes that were differentially expressed. For each run on a pair of sample groups, we first reduced the number of genes by removing those with median less than 5 RPKM in both groups, and those for which the Wilcoxon BH adjusted P-value between the two groups was greater than 0.05. This subset of genes was submitted to SAMseq. Each run generated a pair of files: genes ‘up’ and ‘down’. We then ranked the genes by a median-based fold change, and generated a figure showing up to 10 of the largest fold changes in each direction.

mRNA-NMF:

For specific mRNA-Seq expression datasets, we first removed genes expressed at or below a noise threshold of ≤ 0.2 reads per kilobase (of gene model) per million mapped reads (RPKM) in at least 75% of samples. We created the NMF input matrix using the top 25% most-variant genes, by ranking expressed genes having a mean RPKM of at least 10 by the coefficient of variation. We generated consensus clustering results with NMF v0.5.02 in R v1.12.0, with the default Brunet algorithm, and 200 iterations for the clustering run. Rank survey profiles for cophenetic and silhouette width suggest a specific cluster solution.

RNA-Seq (plate-based) library construction (pre-2014):

2-3 ug total RNA samples were arrayed into a 96-well plate and polyadenylated (PolyA+) RNA was purified using the 96-well MultiMACS mRNA isolation kit on the MultiMACS 96 separator (Miltenyi Biotec, Germany) with on column DNaseI-treatment as per the manufacturer's instructions. The eluted PolyA+ RNA was ethanol precipitated and resuspended in 10µL of DEPC treated water with 1:20 SuperaseIN (Life Technologies, USA). Double-stranded cDNA was synthesized from the purified polyA+RNA using the Superscript Double-Stranded cDNA Synthesis kit (Life Technologies, USA) and random hexamer primers at a concentration of 5µM. The cDNA was quantified in a 96-well format using PicoGreen (Life Technologies, USA) and VICTOR3V Spectrophotometer (PerkinElmer, Inc. USA). The quality was checked for a random sampling on an Agilent Bioanalyzer using the High Sensitivity DNA chip Assay.  cDNA was fragmented by Covaris E210 (Covaris, USA) sonication for 55 seconds, a “Duty cycle” of 20% and “Intensity” of 5. Plate-based libraries were prepared following the BC Cancer Agency, Genome Sciences Centre paired-end (PE) protocol on a Biomek FX robot (Beckman-Coulter, USA). Briefly, the cDNA was purified in 96-well format using Ampure XP SPRI beads, and was subject to end-repair and phosphorylation by T4 DNA polymerase, Klenow DNA Polymerase, and T4 polynucleotide kinase respectively in a single reaction, followed by cleanup using Ampure XP SPRI beads and 3’ A-tailling by Klenow fragment (3’ to 5’ exo minus). After cleanup using Ampure XP SPRI beads, picogreen quantification was performed to determine the amount of Illumina PE adapters used in the next step of adapter ligation reaction. The adapter-ligated products were purified using Ampure XP SPRI beads, then PCR-amplified with Phusion DNA Polymerase (Thermo Fisher Scientific Inc. USA) using Illumina’s PE primer set,  with cycle condition 98˚C  30sec followed by 10-15 cycles of 98˚C  10 sec, 65˚C  30 sec and 72˚C  30 sec, and then 72˚C  5min. The PCR products were purified using Ampure XP SPRI beads, and checked with Caliper LabChip GX for DNA samples using the High Sensitivity Assay (PerkinElmer, Inc. USA). PCR product of desired size range was purified using an in-house 96-channel size selection robot, and the DNA quality was assessed and quantified using an Agilent DNA 1000 series II assay and Quant-iT dsDNA HS Assay Kit using Qubit fluorometer (Invitrogen), then diluted to 8nM. The final concentration was verified by Quant-iT dsDNA HS Assay prior to Illumina HiSeq2000 PE 75 base sequencing.

Strand-specific RNA-seq (plate based) library construction (post-2014):

Total RNA samples were checked using an Agilent Bioanalyzer RNA nanochip or Caliper GX HT RNA LabChip, and samples passing quality control were arrayed into a 96-well plate. PolyA+ RNA was purified using the 96-well MultiMACS mRNA isolation kit on the MultiMACS 96 separator (Miltenyi Biotec, Germany) from 2ug total RNA with on-column DNaseI-treatment as per the manufacturer's instructions. The eluted PolyA+ RNA was ethanol precipitated and resuspended in 10µL of DEPC treated water with 1:20 SuperaseIN (Life Technologies, USA).

 

First-stranded cDNA was synthesized from the purified polyA+RNA using the Superscript cDNA Synthesis kit (Life Technologies, USA) and random hexamer primers at a concentration of 5µM along with a final concentration of 1ug/uL Actinomycin D, followed by Ampure XP SPRI beads on a Biomek FX robot (Beckman-Coulter, USA). The second strand cDNA was synthesized following the Superscript cDNA Synthesis protocol by replacing the dTTP with dUTP in dNTP mix, allowing second strand to be digested using UNG (Uracil-N-Glycosylase, Life Technologies, USA) in the post-adapter ligation reaction and thus achieving strand specificity.

The cDNA was quantified in a 96-well format using PicoGreen (Life Technologies, USA) and VICTOR3V Spectrophotometer (PerkinElmer, Inc. USA). The cDNA was fragmented by Covaris E210 sonication for 55 seconds at a “Duty cycle” of 20% and “Intensity” of 5. The paired-end sequencing library was prepared following the BC Cancer Agency Genome Sciences Centre strand-specific, plate-based and paired-end library construction protocol on a Biomek FX robot (Beckman-Coulter, USA). Briefly, the cDNA was purified in 96-well format using Ampure XP SPRI beads, and was subject to end-repair, and phosphorylation by T4 DNA polymerase, Klenow DNA Polymerase, and T4 polynucleotide kinase respectively in a single reaction, followed by cleanup using Ampure XP SPRI beads and 3’ A-tailing by Klenow fragment (3’ to 5’ exo minus). After purification using Ampure XP SPRI beads, picogreen quantification was performed to determine the amount of Illumina PE adapters to be used in the next step of adapter ligation reaction. The adapter-ligated products were purified using Ampure XP SPRI beads, and digested with UNG (1U/ul) at 37oC for 30 min followed by deactivation at 95oC for 15 min. The digested cDNA was purified using Ampure XP SPRI beads, and then PCR-amplified with Phusion DNA Polymerase (Thermo Fisher Scientific Inc. USA) using Illumina’s PE primer set,  with cycle condition 98˚C  30sec followed by 10-13 cycles of 98˚C  10 sec, 65˚C  30 sec and 72˚C  30 sec, and then 72˚C  5min. The PCR products were purified using Ampure XP SPRI beads, and checked with Caliper LabChip GX for DNA samples using the High Sensitivity Assay (PerkinElmer, Inc. USA). PCR product of the desired size range was purified using 8% PAGE, and the DNA quality was assessed and quantified using an Agilent DNA 1000 series II assay and Quant-iT dsDNA HS Assay Kit using Qubit fluorometer (Invitrogen), then diluted to 8nM. The final library concentration was double checked and determined by Quant-iT dsDNA HS Assay again for Illumina Sequencing.

RNA-Seq/hg19 read alignment:

Illumina paired-end RNA sequencing reads were aligned to GRCh37-lite genome-plus-junctions reference using BWA version 0.5.7. This reference combined genomic sequences in the GRCh37-lite assembly and exon-exon junction sequences whose corresponding coordinates were defined based on annotations of any transcripts in Ensembl (v59), Refseq and known genes from the UCSC genome browser, which was downloaded on August 19 2010, August 8 2010, and August 19 2010, respectively. Reads that mapped to junction regions were then repositioned back to the genome, and were marked with 'ZJ:Z' tags. BWA is run using default parameters, except that the option (-s) is included to disable Smith-Waterman alignment.

Finally, reads failing the Illumina chastity filter are flagged with a custom script, and duplicated reads were flagged with Picard Tools.

 

Structural variant detection

Was performed using ABySS (v1.3.2) and trans-ABySS (v1.4.6). For RNA-seq assembly alternate k-mers from k50-k96 were performed using positive strand and ambiguous stand reads as well as negative strand and ambiguous strand reads. The positive and negative strand assemblies were extended where possible, merged and then concatenated together to produce a meta-assembly contig dataset. The genome (WGS) libraries were assembled in single end mode using k-mer values of k24, and k44. The contigs and reads were then reassembled at k64 in single end mode and then finally at k64 in paired end mode. The meta-assemblies were then used as input to the trans-ABySS analysis pipeline (Robertson et al., 2010).

Large scale rearrangements and gene fusions from RNA-seq libraries were identified from contigs that had high confidence GMAP (v2012-12-20) alignments to two distinct genomic regions.  Evidence for the alignments were provided from aligning reads back to the contigs and from aligning reads to genomic coordinates. Events were then filtered on read thresholds. Large scale rearrangements and gene fusions from WGS libraries were identified in a similar way, but using BWA (v0.6.2-r126) alignments.

Insertions and deletions were identified by gapped alignment of contigs to the human reference using GMAP for RNA-seq and BWA for WGS. Confidence in the event was calculated from the alignment of reads back to the event breakpoint in the contigs.  The events were then screened against dbSNP and other variation databases to identify putative novel events.

To determine compartment specific events the structural variant calls for each patient from all matched genome and RNA-seq samples were concatenated together and screened against matching genome tumour, and where available germline bam files. This resulted in compartment specific structural variant events and where germline was available putative somatic and germline events. The events were further filtered against a compendium of germline structural variants to remove recurrent false positives.

SNV analysis of strand-specific RNA-seq data:

After repositioning, hg19-aligned BAM files were split into positive-fragment and negative-fragment BAM files based on the orientation of the paired-end reads. Unmapped and improperly paired aligned reads were put into the mix-fragment BAM. SNVs were then detected on positive- and negative-split BAMs separately using SNVMix2 (Goya et al., 2010) with parameters Mb and Q30.  The SNVs were further filtered to exclude those called based on 1) reference base N; 2) only 1 read supports the variant; 3) probability of heterozygous and homozygous of variant allele smaller than 0.99; 4) a position overlapping with insertions or deletions; 5) read supports from positions no more than 5 bases from read ends; 6) supports from reads only spanning an exon-exon junction; 7) more than 0.5 proportion of supporting reads were improper paired; 8) fewer than 2 proper-paired supporting reads.  SNVs located in exons equal or smaller than the read length, 100bp in this case, are a special case, because all their coverage may come from exon-exon junction spanning reads, so we also identified small-exonic SNVs that ware only supported by reads that spanning exon-exon junction but passed all other 7 filtering criteria mentioned above. These SNVs were finally annotated with SnpEff (Cingolani et al., 2012b) (Ensembl 66) and SnpSift (Cingolani et al., 2012a) (dbSNP137 and COSMIC64).

mRNA-Differential expression:

We used SAMseq (samr v2.0, R 2.15.0) two-class unpaired analyses with an FDR threshold of 0.05 to identify genes that were differentially expressed. For each run on a pair of sample groups, we first reduced the number of genes by removing those with median less than 5 RPKM in both groups, and those for which the Wilcoxon BH adjusted P-value between the two groups was greater than 0.05. This subset of genes was submitted to SAMseq. Each run generated a pair of files: genes ‘up’ and ‘down’. We then ranked the genes by a median-based fold change, and generated a figure showing up to 10 of the largest fold changes in each direction.

mRNA-NMF:

For specific mRNA-Seq expression datasets, we first removed genes expressed at or below a noise threshold of ≤ 0.2 reads per kilobase (of gene model) per million mapped reads (RPKM) in at least 75% of samples. We created the NMF input matrix using the top 25% most-variant genes, by ranking expressed genes having a mean RPKM of at least 10 by the coefficient of variation. We generated consensus clustering results with NMF v0.5.02 in R v1.12.0, with the default Brunet algorithm, and 200 iterations for the clustering run. Rank survey profiles for cophenetic and silhouette width suggest a specific cluster solution.

RNA-Seq (plate-based) library construction (pre-2014):

2-3 ug total RNA samples were arrayed into a 96-well plate and polyadenylated (PolyA+) RNA was purified using the 96-well MultiMACS mRNA isolation kit on the MultiMACS 96 separator (Miltenyi Biotec, Germany) with on column DNaseI-treatment as per the manufacturer's instructions. The eluted PolyA+ RNA was ethanol precipitated and resuspended in 10µL of DEPC treated water with 1:20 SuperaseIN (Life Technologies, USA). Double-stranded cDNA was synthesized from the purified polyA+RNA using the Superscript Double-Stranded cDNA Synthesis kit (Life Technologies, USA) and random hexamer primers at a concentration of 5µM. The cDNA was quantified in a 96-well format using PicoGreen (Life Technologies, USA) and VICTOR3V Spectrophotometer (PerkinElmer, Inc. USA). The quality was checked for a random sampling on an Agilent Bioanalyzer using the High Sensitivity DNA chip Assay.  cDNA was fragmented by Covaris E210 (Covaris, USA) sonication for 55 seconds, a “Duty cycle” of 20% and “Intensity” of 5. Plate-based libraries were prepared following the BC Cancer Agency, Genome Sciences Centre paired-end (PE) protocol on a Biomek FX robot (Beckman-Coulter, USA). Briefly, the cDNA was purified in 96-well format using Ampure XP SPRI beads, and was subject to end-repair and phosphorylation by T4 DNA polymerase, Klenow DNA Polymerase, and T4 polynucleotide kinase respectively in a single reaction, followed by cleanup using Ampure XP SPRI beads and 3’ A-tailling by Klenow fragment (3’ to 5’ exo minus). After cleanup using Ampure XP SPRI beads, picogreen quantification was performed to determine the amount of Illumina PE adapters used in the next step of adapter ligation reaction. The adapter-ligated products were purified using Ampure XP SPRI beads, then PCR-amplified with Phusion DNA Polymerase (Thermo Fisher Scientific Inc. USA) using Illumina’s PE primer set,  with cycle condition 98˚C  30sec followed by 10-15 cycles of 98˚C  10 sec, 65˚C  30 sec and 72˚C  30 sec, and then 72˚C  5min. The PCR products were purified using Ampure XP SPRI beads, and checked with Caliper LabChip GX for DNA samples using the High Sensitivity Assay (PerkinElmer, Inc. USA). PCR product of desired size range was purified using an in-house 96-channel size selection robot, and the DNA quality was assessed and quantified using an Agilent DNA 1000 series II assay and Quant-iT dsDNA HS Assay Kit using Qubit fluorometer (Invitrogen), then diluted to 8nM. The final concentration was verified by Quant-iT dsDNA HS Assay prior to Illumina HiSeq2000 PE 75 base sequencing.

Strand-specific RNA-seq (plate based) library construction (post-2014):

Total RNA samples were checked using an Agilent Bioanalyzer RNA nanochip or Caliper GX HT RNA LabChip, and samples passing quality control were arrayed into a 96-well plate. PolyA+ RNA was purified using the 96-well MultiMACS mRNA isolation kit on the MultiMACS 96 separator (Miltenyi Biotec, Germany) from 2ug total RNA with on-column DNaseI-treatment as per the manufacturer's instructions. The eluted PolyA+ RNA was ethanol precipitated and resuspended in 10µL of DEPC treated water with 1:20 SuperaseIN (Life Technologies, USA).

 

First-stranded cDNA was synthesized from the purified polyA+RNA using the Superscript cDNA Synthesis kit (Life Technologies, USA) and random hexamer primers at a concentration of 5µM along with a final concentration of 1ug/uL Actinomycin D, followed by Ampure XP SPRI beads on a Biomek FX robot (Beckman-Coulter, USA). The second strand cDNA was synthesized following the Superscript cDNA Synthesis protocol by replacing the dTTP with dUTP in dNTP mix, allowing second strand to be digested using UNG (Uracil-N-Glycosylase, Life Technologies, USA) in the post-adapter ligation reaction and thus achieving strand specificity.

The cDNA was quantified in a 96-well format using PicoGreen (Life Technologies, USA) and VICTOR3V Spectrophotometer (PerkinElmer, Inc. USA). The cDNA was fragmented by Covaris E210 sonication for 55 seconds at a “Duty cycle” of 20% and “Intensity” of 5. The paired-end sequencing library was prepared following the BC Cancer Agency Genome Sciences Centre strand-specific, plate-based and paired-end library construction protocol on a Biomek FX robot (Beckman-Coulter, USA). Briefly, the cDNA was purified in 96-well format using Ampure XP SPRI beads, and was subject to end-repair, and phosphorylation by T4 DNA polymerase, Klenow DNA Polymerase, and T4 polynucleotide kinase respectively in a single reaction, followed by cleanup using Ampure XP SPRI beads and 3’ A-tailing by Klenow fragment (3’ to 5’ exo minus). After purification using Ampure XP SPRI beads, picogreen quantification was performed to determine the amount of Illumina PE adapters to be used in the next step of adapter ligation reaction. The adapter-ligated products were purified using Ampure XP SPRI beads, and digested with UNG (1U/ul) at 37oC for 30 min followed by deactivation at 95oC for 15 min. The digested cDNA was purified using Ampure XP SPRI beads, and then PCR-amplified with Phusion DNA Polymerase (Thermo Fisher Scientific Inc. USA) using Illumina’s PE primer set,  with cycle condition 98˚C  30sec followed by 10-13 cycles of 98˚C  10 sec, 65˚C  30 sec and 72˚C  30 sec, and then 72˚C  5min. The PCR products were purified using Ampure XP SPRI beads, and checked with Caliper LabChip GX for DNA samples using the High Sensitivity Assay (PerkinElmer, Inc. USA). PCR product of the desired size range was purified using 8% PAGE, and the DNA quality was assessed and quantified using an Agilent DNA 1000 series II assay and Quant-iT dsDNA HS Assay Kit using Qubit fluorometer (Invitrogen), then diluted to 8nM. The final library concentration was double checked and determined by Quant-iT dsDNA HS Assay again for Illumina Sequencing.

RNA-Seq/hg19 read alignment:

Illumina paired-end RNA sequencing reads were aligned to GRCh37-lite genome-plus-junctions reference using BWA version 0.5.7. This reference combined genomic sequences in the GRCh37-lite assembly and exon-exon junction sequences whose corresponding coordinates were defined based on annotations of any transcripts in Ensembl (v59), Refseq and known genes from the UCSC genome browser, which was downloaded on August 19 2010, August 8 2010, and August 19 2010, respectively. Reads that mapped to junction regions were then repositioned back to the genome, and were marked with 'ZJ:Z' tags. BWA is run using default parameters, except that the option (-s) is included to disable Smith-Waterman alignment.

Finally, reads failing the Illumina chastity filter are flagged with a custom script, and duplicated reads were flagged with Picard Tools.

 

Structural variant detection

Was performed using ABySS (v1.3.2) and trans-ABySS (v1.4.6). For RNA-seq assembly alternate k-mers from k50-k96 were performed using positive strand and ambiguous stand reads as well as negative strand and ambiguous strand reads. The positive and negative strand assemblies were extended where possible, merged and then concatenated together to produce a meta-assembly contig dataset. The genome (WGS) libraries were assembled in single end mode using k-mer values of k24, and k44. The contigs and reads were then reassembled at k64 in single end mode and then finally at k64 in paired end mode. The meta-assemblies were then used as input to the trans-ABySS analysis pipeline (Robertson et al., 2010).

Large scale rearrangements and gene fusions from RNA-seq libraries were identified from contigs that had high confidence GMAP (v2012-12-20) alignments to two distinct genomic regions.  Evidence for the alignments were provided from aligning reads back to the contigs and from aligning reads to genomic coordinates. Events were then filtered on read thresholds. Large scale rearrangements and gene fusions from WGS libraries were identified in a similar way, but using BWA (v0.6.2-r126) alignments.

Insertions and deletions were identified by gapped alignment of contigs to the human reference using GMAP for RNA-seq and BWA for WGS. Confidence in the event was calculated from the alignment of reads back to the event breakpoint in the contigs.  The events were then screened against dbSNP and other variation databases to identify putative novel events.

To determine compartment specific events the structural variant calls for each patient from all matched genome and RNA-seq samples were concatenated together and screened against matching genome tumour, and where available germline bam files. This resulted in compartment specific structural variant events and where germline was available putative somatic and germline events. The events were further filtered against a compendium of germline structural variants to remove recurrent false positives.

SNV analysis of strand-specific RNA-seq data:

After repositioning, hg19-aligned BAM files were split into positive-fragment and negative-fragment BAM files based on the orientation of the paired-end reads. Unmapped and improperly paired aligned reads were put into the mix-fragment BAM. SNVs were then detected on positive- and negative-split BAMs separately using SNVMix2 (Goya et al., 2010) with parameters Mb and Q30.  The SNVs were further filtered to exclude those called based on 1) reference base N; 2) only 1 read supports the variant; 3) probability of heterozygous and homozygous of variant allele smaller than 0.99; 4) a position overlapping with insertions or deletions; 5) read supports from positions no more than 5 bases from read ends; 6) supports from reads only spanning an exon-exon junction; 7) more than 0.5 proportion of supporting reads were improper paired; 8) fewer than 2 proper-paired supporting reads.  SNVs located in exons equal or smaller than the read length, 100bp in this case, are a special case, because all their coverage may come from exon-exon junction spanning reads, so we also identified small-exonic SNVs that ware only supported by reads that spanning exon-exon junction but passed all other 7 filtering criteria mentioned above. These SNVs were finally annotated with SnpEff (Cingolani et al., 2012b) (Ensembl 66) and SnpSift (Cingolani et al., 2012a) (dbSNP137 and COSMIC64).

mRNA-Differential expression:

We used SAMseq (samr v2.0, R 2.15.0) two-class unpaired analyses with an FDR threshold of 0.05 to identify genes that were differentially expressed. For each run on a pair of sample groups, we first reduced the number of genes by removing those with median less than 5 RPKM in both groups, and those for which the Wilcoxon BH adjusted P-value between the two groups was greater than 0.05. This subset of genes was submitted to SAMseq. Each run generated a pair of files: genes ‘up’ and ‘down’. We then ranked the genes by a median-based fold change, and generated a figure showing up to 10 of the largest fold changes in each direction.

mRNA-NMF:

For specific mRNA-Seq expression datasets, we first removed genes expressed at or below a noise threshold of ≤ 0.2 reads per kilobase (of gene model) per million mapped reads (RPKM) in at least 75% of samples. We created the NMF input matrix using the top 25% most-variant genes, by ranking expressed genes having a mean RPKM of at least 10 by the coefficient of variation. We generated consensus clustering results with NMF v0.5.02 in R v1.12.0, with the default Brunet algorithm, and 200 iterations for the clustering run. Rank survey profiles for cophenetic and silhouette width suggest a specific cluster solution.

RNA-Seq (plate-based) library construction (pre-2014):

2-3 ug total RNA samples were arrayed into a 96-well plate and polyadenylated (PolyA+) RNA was purified using the 96-well MultiMACS mRNA isolation kit on the MultiMACS 96 separator (Miltenyi Biotec, Germany) with on column DNaseI-treatment as per the manufacturer's instructions. The eluted PolyA+ RNA was ethanol precipitated and resuspended in 10µL of DEPC treated water with 1:20 SuperaseIN (Life Technologies, USA). Double-stranded cDNA was synthesized from the purified polyA+RNA using the Superscript Double-Stranded cDNA Synthesis kit (Life Technologies, USA) and random hexamer primers at a concentration of 5µM. The cDNA was quantified in a 96-well format using PicoGreen (Life Technologies, USA) and VICTOR3V Spectrophotometer (PerkinElmer, Inc. USA). The quality was checked for a random sampling on an Agilent Bioanalyzer using the High Sensitivity DNA chip Assay.  cDNA was fragmented by Covaris E210 (Covaris, USA) sonication for 55 seconds, a “Duty cycle” of 20% and “Intensity” of 5. Plate-based libraries were prepared following the BC Cancer Agency, Genome Sciences Centre paired-end (PE) protocol on a Biomek FX robot (Beckman-Coulter, USA). Briefly, the cDNA was purified in 96-well format using Ampure XP SPRI beads, and was subject to end-repair and phosphorylation by T4 DNA polymerase, Klenow DNA Polymerase, and T4 polynucleotide kinase respectively in a single reaction, followed by cleanup using Ampure XP SPRI beads and 3’ A-tailling by Klenow fragment (3’ to 5’ exo minus). After cleanup using Ampure XP SPRI beads, picogreen quantification was performed to determine the amount of Illumina PE adapters used in the next step of adapter ligation reaction. The adapter-ligated products were purified using Ampure XP SPRI beads, then PCR-amplified with Phusion DNA Polymerase (Thermo Fisher Scientific Inc. USA) using Illumina’s PE primer set,  with cycle condition 98˚C  30sec followed by 10-15 cycles of 98˚C  10 sec, 65˚C  30 sec and 72˚C  30 sec, and then 72˚C  5min. The PCR products were purified using Ampure XP SPRI beads, and checked with Caliper LabChip GX for DNA samples using the High Sensitivity Assay (PerkinElmer, Inc. USA). PCR product of desired size range was purified using an in-house 96-channel size selection robot, and the DNA quality was assessed and quantified using an Agilent DNA 1000 series II assay and Quant-iT dsDNA HS Assay Kit using Qubit fluorometer (Invitrogen), then diluted to 8nM. The final concentration was verified by Quant-iT dsDNA HS Assay prior to Illumina HiSeq2000 PE 75 base sequencing.

Strand-specific RNA-seq (plate based) library construction (post-2014):

Total RNA samples were checked using an Agilent Bioanalyzer RNA nanochip or Caliper GX HT RNA LabChip, and samples passing quality control were arrayed into a 96-well plate. PolyA+ RNA was purified using the 96-well MultiMACS mRNA isolation kit on the MultiMACS 96 separator (Miltenyi Biotec, Germany) from 2ug total RNA with on-column DNaseI-treatment as per the manufacturer's instructions. The eluted PolyA+ RNA was ethanol precipitated and resuspended in 10µL of DEPC treated water with 1:20 SuperaseIN (Life Technologies, USA).

 

First-stranded cDNA was synthesized from the purified polyA+RNA using the Superscript cDNA Synthesis kit (Life Technologies, USA) and random hexamer primers at a concentration of 5µM along with a final concentration of 1ug/uL Actinomycin D, followed by Ampure XP SPRI beads on a Biomek FX robot (Beckman-Coulter, USA). The second strand cDNA was synthesized following the Superscript cDNA Synthesis protocol by replacing the dTTP with dUTP in dNTP mix, allowing second strand to be digested using UNG (Uracil-N-Glycosylase, Life Technologies, USA) in the post-adapter ligation reaction and thus achieving strand specificity.

The cDNA was quantified in a 96-well format using PicoGreen (Life Technologies, USA) and VICTOR3V Spectrophotometer (PerkinElmer, Inc. USA). The cDNA was fragmented by Covaris E210 sonication for 55 seconds at a “Duty cycle” of 20% and “Intensity” of 5. The paired-end sequencing library was prepared following the BC Cancer Agency Genome Sciences Centre strand-specific, plate-based and paired-end library construction protocol on a Biomek FX robot (Beckman-Coulter, USA). Briefly, the cDNA was purified in 96-well format using Ampure XP SPRI beads, and was subject to end-repair, and phosphorylation by T4 DNA polymerase, Klenow DNA Polymerase, and T4 polynucleotide kinase respectively in a single reaction, followed by cleanup using Ampure XP SPRI beads and 3’ A-tailing by Klenow fragment (3’ to 5’ exo minus). After purification using Ampure XP SPRI beads, picogreen quantification was performed to determine the amount of Illumina PE adapters to be used in the next step of adapter ligation reaction. The adapter-ligated products were purified using Ampure XP SPRI beads, and digested with UNG (1U/ul) at 37oC for 30 min followed by deactivation at 95oC for 15 min. The digested cDNA was purified using Ampure XP SPRI beads, and then PCR-amplified with Phusion DNA Polymerase (Thermo Fisher Scientific Inc. USA) using Illumina’s PE primer set,  with cycle condition 98˚C  30sec followed by 10-13 cycles of 98˚C  10 sec, 65˚C  30 sec and 72˚C  30 sec, and then 72˚C  5min. The PCR products were purified using Ampure XP SPRI beads, and checked with Caliper LabChip GX for DNA samples using the High Sensitivity Assay (PerkinElmer, Inc. USA). PCR product of the desired size range was purified using 8% PAGE, and the DNA quality was assessed and quantified using an Agilent DNA 1000 series II assay and Quant-iT dsDNA HS Assay Kit using Qubit fluorometer (Invitrogen), then diluted to 8nM. The final library concentration was double checked and determined by Quant-iT dsDNA HS Assay again for Illumina Sequencing.

RNA-Seq/hg19 read alignment:

Illumina paired-end RNA sequencing reads were aligned to GRCh37-lite genome-plus-junctions reference using BWA version 0.5.7. This reference combined genomic sequences in the GRCh37-lite assembly and exon-exon junction sequences whose corresponding coordinates were defined based on annotations of any transcripts in Ensembl (v59), Refseq and known genes from the UCSC genome browser, which was downloaded on August 19 2010, August 8 2010, and August 19 2010, respectively. Reads that mapped to junction regions were then repositioned back to the genome, and were marked with 'ZJ:Z' tags. BWA is run using default parameters, except that the option (-s) is included to disable Smith-Waterman alignment.

Finally, reads failing the Illumina chastity filter are flagged with a custom script, and duplicated reads were flagged with Picard Tools.

 

Structural variant detection

Was performed using ABySS (v1.3.2) and trans-ABySS (v1.4.6). For RNA-seq assembly alternate k-mers from k50-k96 were performed using positive strand and ambiguous stand reads as well as negative strand and ambiguous strand reads. The positive and negative strand assemblies were extended where possible, merged and then concatenated together to produce a meta-assembly contig dataset. The genome (WGS) libraries were assembled in single end mode using k-mer values of k24, and k44. The contigs and reads were then reassembled at k64 in single end mode and then finally at k64 in paired end mode. The meta-assemblies were then used as input to the trans-ABySS analysis pipeline (Robertson et al., 2010).

Large scale rearrangements and gene fusions from RNA-seq libraries were identified from contigs that had high confidence GMAP (v2012-12-20) alignments to two distinct genomic regions.  Evidence for the alignments were provided from aligning reads back to the contigs and from aligning reads to genomic coordinates. Events were then filtered on read thresholds. Large scale rearrangements and gene fusions from WGS libraries were identified in a similar way, but using BWA (v0.6.2-r126) alignments.

Insertions and deletions were identified by gapped alignment of contigs to the human reference using GMAP for RNA-seq and BWA for WGS. Confidence in the event was calculated from the alignment of reads back to the event breakpoint in the contigs.  The events were then screened against dbSNP and other variation databases to identify putative novel events.

To determine compartment specific events the structural variant calls for each patient from all matched genome and RNA-seq samples were concatenated together and screened against matching genome tumour, and where available germline bam files. This resulted in compartment specific structural variant events and where germline was available putative somatic and germline events. The events were further filtered against a compendium of germline structural variants to remove recurrent false positives.

SNV analysis of strand-specific RNA-seq data:

After repositioning, hg19-aligned BAM files were split into positive-fragment and negative-fragment BAM files based on the orientation of the paired-end reads. Unmapped and improperly paired aligned reads were put into the mix-fragment BAM. SNVs were then detected on positive- and negative-split BAMs separately using SNVMix2 (Goya et al., 2010) with parameters Mb and Q30.  The SNVs were further filtered to exclude those called based on 1) reference base N; 2) only 1 read supports the variant; 3) probability of heterozygous and homozygous of variant allele smaller than 0.99; 4) a position overlapping with insertions or deletions; 5) read supports from positions no more than 5 bases from read ends; 6) supports from reads only spanning an exon-exon junction; 7) more than 0.5 proportion of supporting reads were improper paired; 8) fewer than 2 proper-paired supporting reads.  SNVs located in exons equal or smaller than the read length, 100bp in this case, are a special case, because all their coverage may come from exon-exon junction spanning reads, so we also identified small-exonic SNVs that ware only supported by reads that spanning exon-exon junction but passed all other 7 filtering criteria mentioned above. These SNVs were finally annotated with SnpEff (Cingolani et al., 2012b) (Ensembl 66) and SnpSift (Cingolani et al., 2012a) (dbSNP137 and COSMIC64).

mRNA-Differential expression:

We used SAMseq (samr v2.0, R 2.15.0) two-class unpaired analyses with an FDR threshold of 0.05 to identify genes that were differentially expressed. For each run on a pair of sample groups, we first reduced the number of genes by removing those with median less than 5 RPKM in both groups, and those for which the Wilcoxon BH adjusted P-value between the two groups was greater than 0.05. This subset of genes was submitted to SAMseq. Each run generated a pair of files: genes ‘up’ and ‘down’. We then ranked the genes by a median-based fold change, and generated a figure showing up to 10 of the largest fold changes in each direction.

mRNA-NMF:

For specific mRNA-Seq expression datasets, we first removed genes expressed at or below a noise threshold of ≤ 0.2 reads per kilobase (of gene model) per million mapped reads (RPKM) in at least 75% of samples. We created the NMF input matrix using the top 25% most-variant genes, by ranking expressed genes having a mean RPKM of at least 10 by the coefficient of variation. We generated consensus clustering results with NMF v0.5.02 in R v1.12.0, with the default Brunet algorithm, and 200 iterations for the clustering run. Rank survey profiles for cophenetic and silhouette width suggest a specific cluster solution.

RNA-Seq (plate-based) library construction (pre-2014):

2-3 ug total RNA samples were arrayed into a 96-well plate and polyadenylated (PolyA+) RNA was purified using the 96-well MultiMACS mRNA isolation kit on the MultiMACS 96 separator (Miltenyi Biotec, Germany) with on column DNaseI-treatment as per the manufacturer's instructions. The eluted PolyA+ RNA was ethanol precipitated and resuspended in 10µL of DEPC treated water with 1:20 SuperaseIN (Life Technologies, USA). Double-stranded cDNA was synthesized from the purified polyA+RNA using the Superscript Double-Stranded cDNA Synthesis kit (Life Technologies, USA) and random hexamer primers at a concentration of 5µM. The cDNA was quantified in a 96-well format using PicoGreen (Life Technologies, USA) and VICTOR3V Spectrophotometer (PerkinElmer, Inc. USA). The quality was checked for a random sampling on an Agilent Bioanalyzer using the High Sensitivity DNA chip Assay.  cDNA was fragmented by Covaris E210 (Covaris, USA) sonication for 55 seconds, a “Duty cycle” of 20% and “Intensity” of 5. Plate-based libraries were prepared following the BC Cancer Agency, Genome Sciences Centre paired-end (PE) protocol on a Biomek FX robot (Beckman-Coulter, USA). Briefly, the cDNA was purified in 96-well format using Ampure XP SPRI beads, and was subject to end-repair and phosphorylation by T4 DNA polymerase, Klenow DNA Polymerase, and T4 polynucleotide kinase respectively in a single reaction, followed by cleanup using Ampure XP SPRI beads and 3’ A-tailling by Klenow fragment (3’ to 5’ exo minus). After cleanup using Ampure XP SPRI beads, picogreen quantification was performed to determine the amount of Illumina PE adapters used in the next step of adapter ligation reaction. The adapter-ligated products were purified using Ampure XP SPRI beads, then PCR-amplified with Phusion DNA Polymerase (Thermo Fisher Scientific Inc. USA) using Illumina’s PE primer set,  with cycle condition 98˚C  30sec followed by 10-15 cycles of 98˚C  10 sec, 65˚C  30 sec and 72˚C  30 sec, and then 72˚C  5min. The PCR products were purified using Ampure XP SPRI beads, and checked with Caliper LabChip GX for DNA samples using the High Sensitivity Assay (PerkinElmer, Inc. USA). PCR product of desired size range was purified using an in-house 96-channel size selection robot, and the DNA quality was assessed and quantified using an Agilent DNA 1000 series II assay and Quant-iT dsDNA HS Assay Kit using Qubit fluorometer (Invitrogen), then diluted to 8nM. The final concentration was verified by Quant-iT dsDNA HS Assay prior to Illumina HiSeq2000 PE 75 base sequencing.

Strand-specific RNA-seq (plate based) library construction (post-2014):

Total RNA samples were checked using an Agilent Bioanalyzer RNA nanochip or Caliper GX HT RNA LabChip, and samples passing quality control were arrayed into a 96-well plate. PolyA+ RNA was purified using the 96-well MultiMACS mRNA isolation kit on the MultiMACS 96 separator (Miltenyi Biotec, Germany) from 2ug total RNA with on-column DNaseI-treatment as per the manufacturer's instructions. The eluted PolyA+ RNA was ethanol precipitated and resuspended in 10µL of DEPC treated water with 1:20 SuperaseIN (Life Technologies, USA).

 

First-stranded cDNA was synthesized from the purified polyA+RNA using the Superscript cDNA Synthesis kit (Life Technologies, USA) and random hexamer primers at a concentration of 5µM along with a final concentration of 1ug/uL Actinomycin D, followed by Ampure XP SPRI beads on a Biomek FX robot (Beckman-Coulter, USA). The second strand cDNA was synthesized following the Superscript cDNA Synthesis protocol by replacing the dTTP with dUTP in dNTP mix, allowing second strand to be digested using UNG (Uracil-N-Glycosylase, Life Technologies, USA) in the post-adapter ligation reaction and thus achieving strand specificity.

The cDNA was quantified in a 96-well format using PicoGreen (Life Technologies, USA) and VICTOR3V Spectrophotometer (PerkinElmer, Inc. USA). The cDNA was fragmented by Covaris E210 sonication for 55 seconds at a “Duty cycle” of 20% and “Intensity” of 5. The paired-end sequencing library was prepared following the BC Cancer Agency Genome Sciences Centre strand-specific, plate-based and paired-end library construction protocol on a Biomek FX robot (Beckman-Coulter, USA). Briefly, the cDNA was purified in 96-well format using Ampure XP SPRI beads, and was subject to end-repair, and phosphorylation by T4 DNA polymerase, Klenow DNA Polymerase, and T4 polynucleotide kinase respectively in a single reaction, followed by cleanup using Ampure XP SPRI beads and 3’ A-tailing by Klenow fragment (3’ to 5’ exo minus). After purification using Ampure XP SPRI beads, picogreen quantification was performed to determine the amount of Illumina PE adapters to be used in the next step of adapter ligation reaction. The adapter-ligated products were purified using Ampure XP SPRI beads, and digested with UNG (1U/ul) at 37oC for 30 min followed by deactivation at 95oC for 15 min. The digested cDNA was purified using Ampure XP SPRI beads, and then PCR-amplified with Phusion DNA Polymerase (Thermo Fisher Scientific Inc. USA) using Illumina’s PE primer set,  with cycle condition 98˚C  30sec followed by 10-13 cycles of 98˚C  10 sec, 65˚C  30 sec and 72˚C  30 sec, and then 72˚C  5min. The PCR products were purified using Ampure XP SPRI beads, and checked with Caliper LabChip GX for DNA samples using the High Sensitivity Assay (PerkinElmer, Inc. USA). PCR product of the desired size range was purified using 8% PAGE, and the DNA quality was assessed and quantified using an Agilent DNA 1000 series II assay and Quant-iT dsDNA HS Assay Kit using Qubit fluorometer (Invitrogen), then diluted to 8nM. The final library concentration was double checked and determined by Quant-iT dsDNA HS Assay again for Illumina Sequencing.

Strand-specific ribodepletion RNA sequencing:

Enzymatic reactions were set-up in a 96-well plate (Thermo Fisher Scientific) on a Microlab NIMBUS liquid handler (Hamilton Robotics, USA). 100ng of DNase I treated total RNA in 6 µL was hybridized to rRNA probes in a 7.5 µL reaction. Heat-sealed plates were incubated at 95oC for 2 minutes followed by incremental reduction in temperature by 0.1oC per second to 22oC (730 cycles). The rRNA in DNA hybrids were digested using RNase H in a 10 µL reaction incubated in a thermocycler at 37oC for 30 minutes. To remove excess rRNA probes (DNA) and residual genomic DNA contamination, DNase I was added in a total reaction volume of 25 µL and incubated at 37oC for 30 minutes. RNA was purified using RNA MagClean DX beads (Aline Biosciences, USA) with 15 minutes of binding time, 7 minutes clearing on a magnet followed by two 70% ethanol washes, 5 minutes to air dry the RNA pellet and elution in 36uL DEPC water. The plate containing RNA was stored at -80oC prior to cDNA synthesis.

First-strand cDNA was synthesized from the purified RNA (minus rRNA) using the Maxima H Minus First Strand cDNA Synthesis kit (Thermo-Fisher, USA) and random hexamer primers at a concentration of 8ng/µL along with a final concentration of 0.4µg/µL Actinomycin D, followed by PCR Clean DX bead purification on a Microlab NIMBUS robot (Hamilton Robotics, USA). The second strand cDNA was synthesized following the NEBNext Ultra Directional Second Strand cDNA Synthesis protocol (NEB) that incorporates dUTP in the dNTP mix, allowing the second strand to be digested using USERTM enzyme (NEB) in the post-adapter ligation reaction and thus achieving strand specificity.

cDNA was fragmented by Covaris LE220 sonication for 130seconds (2x65seconds) at a “Duty cycle” of 30%, 450 Peak Incident Power (W) and 200 Cycles per Burst in a 96-well microTUBE Plate (P/N: 520078) to achieve 200-250 bp average fragment lengths. The paired-end sequencing library was prepared following the BC Cancer Agency Genome Sciences Centre strand-specific, plate-based library construction protocol on a Microlab NIMBUS robot (Hamilton Robotics, USA). Briefly, the sheared cDNA was subject to end-repair and phosphorylation in a single reaction using an enzyme premix (NEB) containing T4 DNA polymerase, Klenow DNA Polymerase and T4 polynucleotide kinase, incubated at 20oC for 30 minutes. Repaired cDNA was purified in 96-well format using PCR Clean DX beads (Aline Biosciences, USA), and 3’ A-tailed (adenylation) using Klenow fragment (3’ to 5’ exo minus) and incubation at 37oC for 30 minutes prior to enzyme heat inactivation. Illumina PE adapters were ligated at 20oC for 15 minutes. The adapter-ligated products were purified using PCR Clean DX beads, then digested with USERTM enzyme (1 U/µL, NEB) at 37oC for 15 minutes followed immediately by 13 cycles of indexed PCR using Phusion DNA Polymerase (Thermo Fisher Scientific Inc. USA) and Illumina’s PE primer set. PCR parameters: 98˚C for 1 minute followed by 13 cycles of 98˚C 15 seconds, 65˚C 30 seconds and 72˚C 30 seconds, and then 72˚C 5 minutes. The PCR products were purified and size-selected using a 1:1 PCR Clean DX beads-to-sample ratio (twice), and the eluted DNA quality was assessed with Caliper LabChip GX for DNA samples using the High Sensitivity Assay (PerkinElmer, Inc. USA) and quantified using a Quant-iT dsDNA High Sensitivity Assay Kit on a Qubit fluorometer (Invitrogen) prior to library pooling and size-corrected final molar concentration calculation for Illumina HiSeq2500 sequencing with paired-end 75 base reads.

RNA-seq lite library construction (AML-IF):

For each sample, approximately 10ng of total RNA was processed using the SMART(TM) cDNA synthesis protocol including SMARTScribe Reverse Transcriptase (Clontech, #639536). This method deploys a modified oligo(dT) primer to prime the first strand synthesis reaction and a template switching mechanism to generate full-length single-stranded cDNAs containing the complete 5’ end of the mRNA as well as universal priming sequences for end-to-end amplification during 20 cycles of PCR. The amplified cDNA was subject to Illumina paired-end library construction using NEBNext paired-end DNA sample Prep Kit (NEB, E6000B-25). Libraries were sequenced on Illumina HiSeq2000 instruments.

Strand-specific RNA-seq (plate based) library construction:

Total RNA samples were checked using an Agilent Bioanalyzer RNA nanochip or Caliper GX HT RNA LabChip, and samples passing quality control were arrayed into a 96-well plate. PolyA+ RNA was purified using the 96-well MultiMACS mRNA isolation kit on the MultiMACS 96 separator (Miltenyi Biotec, Germany) from 2ug total RNA with on-column DNaseI-treatment as per the manufacturer's instructions. The eluted PolyA+ RNA was ethanol precipitated and resuspended in 10µL of DEPC treated water with 1:20 SuperaseIN (Life Technologies, USA).

 

First-stranded cDNA was synthesized from the purified polyA+RNA using the Superscript cDNA Synthesis kit (Life Technologies, USA) and random hexamer primers at a concentration of 5µM along with a final concentration of 1ug/uL Actinomycin D, followed by Ampure XP SPRI beads on a Biomek FX robot (Beckman-Coulter, USA). The second strand cDNA was synthesized following the Superscript cDNA Synthesis protocol by replacing the dTTP with dUTP in dNTP mix, allowing second strand to be digested using UNG (Uracil-N-Glycosylase, Life Technologies, USA) in the post-adapter ligation reaction and thus achieving strand specificity.

The cDNA was quantified in a 96-well format using PicoGreen (Life Technologies, USA) and VICTOR3V Spectrophotometer (PerkinElmer, Inc. USA). The cDNA was fragmented by Covaris E210 sonication for 55 seconds at a “Duty cycle” of 20% and “Intensity” of 5. The paired-end sequencing library was prepared following the BC Cancer Agency Genome Sciences Centre strand-specific, plate-based and paired-end library construction protocol on a Biomek FX robot (Beckman-Coulter, USA). Briefly, the cDNA was purified in 96-well format using Ampure XP SPRI beads, and was subject to end-repair, and phosphorylation by T4 DNA polymerase, Klenow DNA Polymerase, and T4 polynucleotide kinase respectively in a single reaction, followed by cleanup using Ampure XP SPRI beads and 3’ A-tailing by Klenow fragment (3’ to 5’ exo minus). After purification using Ampure XP SPRI beads, picogreen quantification was performed to determine the amount of Illumina PE adapters to be used in the next step of adapter ligation reaction. The adapter-ligated products were purified using Ampure XP SPRI beads, and digested with UNG (1U/ul) at 37oC for 30 min followed by deactivation at 95oC for 15 min. The digested cDNA was purified using Ampure XP SPRI beads, and then PCR-amplified with Phusion DNA Polymerase (Thermo Fisher Scientific Inc. USA) using Illumina’s PE primer set,  with cycle condition 98˚C  30sec followed by 10-13 cycles of 98˚C  10 sec, 65˚C  30 sec and 72˚C  30 sec, and then 72˚C  5min. The PCR products were purified using Ampure XP SPRI beads, and checked with Caliper LabChip GX for DNA samples using the High Sensitivity Assay (PerkinElmer, Inc. USA). PCR product of the desired size range was purified using 8% PAGE, and the DNA quality was assessed and quantified using an Agilent DNA 1000 series II assay and Quant-iT dsDNA HS Assay Kit using Qubit fluorometer (Invitrogen), then diluted to 8nM. The final library concentration was double checked and determined by Quant-iT dsDNA HS Assay again for Illumina Sequencing.

RNA-Seq/hg19 read alignment:

Illumina paired-end RNA sequencing reads were aligned to GRCh37-lite genome-plus-junctions reference using BWA version 0.5.7. This reference combined genomic sequences in the GRCh37-lite assembly and exon-exon junction sequences whose corresponding coordinates were defined based on annotations of any transcripts in Ensembl (v59), Refseq and known genes from the UCSC genome browser, which was downloaded on August 19 2010, August 8 2010, and August 19 2010, respectively. Reads that mapped to junction regions were then repositioned back to the genome, and were marked with 'ZJ:Z' tags. BWA is run using default parameters, except that the option (-s) is included to disable Smith-Waterman alignment.

Finally, reads failing the Illumina chastity filter are flagged with a custom script, and duplicated reads were flagged with Picard Tools.

 

Structural variant detection

Was performed using ABySS (v1.3.2) and trans-ABySS (v1.4.6). For RNA-seq assembly alternate k-mers from k50-k96 were performed using positive strand and ambiguous stand reads as well as negative strand and ambiguous strand reads. The positive and negative strand assemblies were extended where possible, merged and then concatenated together to produce a meta-assembly contig dataset. The genome (WGS) libraries were assembled in single end mode using k-mer values of k24, and k44. The contigs and reads were then reassembled at k64 in single end mode and then finally at k64 in paired end mode. The meta-assemblies were then used as input to the trans-ABySS analysis pipeline (Robertson et al., 2010).

Large scale rearrangements and gene fusions from RNA-seq libraries were identified from contigs that had high confidence GMAP (v2012-12-20) alignments to two distinct genomic regions.  Evidence for the alignments were provided from aligning reads back to the contigs and from aligning reads to genomic coordinates. Events were then filtered on read thresholds. Large scale rearrangements and gene fusions from WGS libraries were identified in a similar way, but using BWA (v0.6.2-r126) alignments.

Insertions and deletions were identified by gapped alignment of contigs to the human reference using GMAP for RNA-seq and BWA for WGS. Confidence in the event was calculated from the alignment of reads back to the event breakpoint in the contigs.  The events were then screened against dbSNP and other variation databases to identify putative novel events.

To determine compartment specific events the structural variant calls for each patient from all matched genome and RNA-seq samples were concatenated together and screened against matching genome tumour, and where available germline bam files. This resulted in compartment specific structural variant events and where germline was available putative somatic and germline events. The events were further filtered against a compendium of germline structural variants to remove recurrent false positives.

SNV analysis of strand-specific RNA-seq data:

After repositioning, hg19-aligned BAM files were split into positive-fragment and negative-fragment BAM files based on the orientation of the paired-end reads. Unmapped and improperly paired aligned reads were put into the mix-fragment BAM. SNVs were then detected on positive- and negative-split BAMs separately using SNVMix2 (Goya et al., 2010) with parameters Mb and Q30.  The SNVs were further filtered to exclude those called based on 1) reference base N; 2) only 1 read supports the variant; 3) probability of heterozygous and homozygous of variant allele smaller than 0.99; 4) a position overlapping with insertions or deletions; 5) read supports from positions no more than 5 bases from read ends; 6) supports from reads only spanning an exon-exon junction; 7) more than 0.5 proportion of supporting reads were improper paired; 8) fewer than 2 proper-paired supporting reads.  SNVs located in exons equal or smaller than the read length, 100bp in this case, are a special case, because all their coverage may come from exon-exon junction spanning reads, so we also identified small-exonic SNVs that ware only supported by reads that spanning exon-exon junction but passed all other 7 filtering criteria mentioned above. These SNVs were finally annotated with SnpEff (Cingolani et al., 2012b) (Ensembl 66) and SnpSift (Cingolani et al., 2012a) (dbSNP137 and COSMIC64).

mRNA-Differential expression:

We used SAMseq (samr v2.0, R 2.15.0) two-class unpaired analyses with an FDR threshold of 0.05 to identify genes that were differentially expressed. For each run on a pair of sample groups, we first reduced the number of genes by removing those with median less than 5 RPKM in both groups, and those for which the Wilcoxon BH adjusted P-value between the two groups was greater than 0.05. This subset of genes was submitted to SAMseq. Each run generated a pair of files: genes ‘up’ and ‘down’. We then ranked the genes by a median-based fold change, and generated a figure showing up to 10 of the largest fold changes in each direction.

mRNA-NMF:

For specific mRNA-Seq expression datasets, we first removed genes expressed at or below a noise threshold of ≤ 0.2 reads per kilobase (of gene model) per million mapped reads (RPKM) in at least 75% of samples. We created the NMF input matrix using the top 25% most-variant genes, by ranking expressed genes having a mean RPKM of at least 10 by the coefficient of variation. We generated consensus clustering results with NMF v0.5.02 in R v1.12.0, with the default Brunet algorithm, and 200 iterations for the clustering run. Rank survey profiles for cophenetic and silhouette width suggest a specific cluster solution.

RNA-seq lite library construction (AML-IF):

For each sample, approximately 10ng of total RNA was processed using the SMART(TM) cDNA synthesis protocol including SMARTScribe Reverse Transcriptase (Clontech, #639536). This method deploys a modified oligo(dT) primer to prime the first strand synthesis reaction and a template switching mechanism to generate full-length single-stranded cDNAs containing the complete 5’ end of the mRNA as well as universal priming sequences for end-to-end amplification during 20 cycles of PCR. The amplified cDNA was subject to Illumina paired-end library construction using NEBNext paired-end DNA sample Prep Kit (NEB, E6000B-25). Libraries were sequenced on Illumina HiSeq2000 instruments.

Strand-specific RNA-seq (plate based) library construction:

Total RNA samples were checked using an Agilent Bioanalyzer RNA nanochip or Caliper GX HT RNA LabChip, and samples passing quality control were arrayed into a 96-well plate. PolyA+ RNA was purified using the 96-well MultiMACS mRNA isolation kit on the MultiMACS 96 separator (Miltenyi Biotec, Germany) from 2ug total RNA with on-column DNaseI-treatment as per the manufacturer's instructions. The eluted PolyA+ RNA was ethanol precipitated and resuspended in 10µL of DEPC treated water with 1:20 SuperaseIN (Life Technologies, USA).

 

First-stranded cDNA was synthesized from the purified polyA+RNA using the Superscript cDNA Synthesis kit (Life Technologies, USA) and random hexamer primers at a concentration of 5µM along with a final concentration of 1ug/uL Actinomycin D, followed by Ampure XP SPRI beads on a Biomek FX robot (Beckman-Coulter, USA). The second strand cDNA was synthesized following the Superscript cDNA Synthesis protocol by replacing the dTTP with dUTP in dNTP mix, allowing second strand to be digested using UNG (Uracil-N-Glycosylase, Life Technologies, USA) in the post-adapter ligation reaction and thus achieving strand specificity.

The cDNA was quantified in a 96-well format using PicoGreen (Life Technologies, USA) and VICTOR3V Spectrophotometer (PerkinElmer, Inc. USA). The cDNA was fragmented by Covaris E210 sonication for 55 seconds at a “Duty cycle” of 20% and “Intensity” of 5. The paired-end sequencing library was prepared following the BC Cancer Agency Genome Sciences Centre strand-specific, plate-based and paired-end library construction protocol on a Biomek FX robot (Beckman-Coulter, USA). Briefly, the cDNA was purified in 96-well format using Ampure XP SPRI beads, and was subject to end-repair, and phosphorylation by T4 DNA polymerase, Klenow DNA Polymerase, and T4 polynucleotide kinase respectively in a single reaction, followed by cleanup using Ampure XP SPRI beads and 3’ A-tailing by Klenow fragment (3’ to 5’ exo minus). After purification using Ampure XP SPRI beads, picogreen quantification was performed to determine the amount of Illumina PE adapters to be used in the next step of adapter ligation reaction. The adapter-ligated products were purified using Ampure XP SPRI beads, and digested with UNG (1U/ul) at 37oC for 30 min followed by deactivation at 95oC for 15 min. The digested cDNA was purified using Ampure XP SPRI beads, and then PCR-amplified with Phusion DNA Polymerase (Thermo Fisher Scientific Inc. USA) using Illumina’s PE primer set,  with cycle condition 98˚C  30sec followed by 10-13 cycles of 98˚C  10 sec, 65˚C  30 sec and 72˚C  30 sec, and then 72˚C  5min. The PCR products were purified using Ampure XP SPRI beads, and checked with Caliper LabChip GX for DNA samples using the High Sensitivity Assay (PerkinElmer, Inc. USA). PCR product of the desired size range was purified using 8% PAGE, and the DNA quality was assessed and quantified using an Agilent DNA 1000 series II assay and Quant-iT dsDNA HS Assay Kit using Qubit fluorometer (Invitrogen), then diluted to 8nM. The final library concentration was double checked and determined by Quant-iT dsDNA HS Assay again for Illumina Sequencing.

RNA-Seq/hg19 read alignment:

Illumina paired-end RNA sequencing reads were aligned to GRCh37-lite genome-plus-junctions reference using BWA version 0.5.7. This reference combined genomic sequences in the GRCh37-lite assembly and exon-exon junction sequences whose corresponding coordinates were defined based on annotations of any transcripts in Ensembl (v59), Refseq and known genes from the UCSC genome browser, which was downloaded on August 19 2010, August 8 2010, and August 19 2010, respectively. Reads that mapped to junction regions were then repositioned back to the genome, and were marked with 'ZJ:Z' tags. BWA is run using default parameters, except that the option (-s) is included to disable Smith-Waterman alignment.

Finally, reads failing the Illumina chastity filter are flagged with a custom script, and duplicated reads were flagged with Picard Tools.

 

Structural variant detection

Was performed using ABySS (v1.3.2) and trans-ABySS (v1.4.6). For RNA-seq assembly alternate k-mers from k50-k96 were performed using positive strand and ambiguous stand reads as well as negative strand and ambiguous strand reads. The positive and negative strand assemblies were extended where possible, merged and then concatenated together to produce a meta-assembly contig dataset. The genome (WGS) libraries were assembled in single end mode using k-mer values of k24, and k44. The contigs and reads were then reassembled at k64 in single end mode and then finally at k64 in paired end mode. The meta-assemblies were then used as input to the trans-ABySS analysis pipeline (Robertson et al., 2010).

Large scale rearrangements and gene fusions from RNA-seq libraries were identified from contigs that had high confidence GMAP (v2012-12-20) alignments to two distinct genomic regions.  Evidence for the alignments were provided from aligning reads back to the contigs and from aligning reads to genomic coordinates. Events were then filtered on read thresholds. Large scale rearrangements and gene fusions from WGS libraries were identified in a similar way, but using BWA (v0.6.2-r126) alignments.

Insertions and deletions were identified by gapped alignment of contigs to the human reference using GMAP for RNA-seq and BWA for WGS. Confidence in the event was calculated from the alignment of reads back to the event breakpoint in the contigs.  The events were then screened against dbSNP and other variation databases to identify putative novel events.

To determine compartment specific events the structural variant calls for each patient from all matched genome and RNA-seq samples were concatenated together and screened against matching genome tumour, and where available germline bam files. This resulted in compartment specific structural variant events and where germline was available putative somatic and germline events. The events were further filtered against a compendium of germline structural variants to remove recurrent false positives.

SNV analysis of strand-specific RNA-seq data:

After repositioning, hg19-aligned BAM files were split into positive-fragment and negative-fragment BAM files based on the orientation of the paired-end reads. Unmapped and improperly paired aligned reads were put into the mix-fragment BAM. SNVs were then detected on positive- and negative-split BAMs separately using SNVMix2 (Goya et al., 2010) with parameters Mb and Q30.  The SNVs were further filtered to exclude those called based on 1) reference base N; 2) only 1 read supports the variant; 3) probability of heterozygous and homozygous of variant allele smaller than 0.99; 4) a position overlapping with insertions or deletions; 5) read supports from positions no more than 5 bases from read ends; 6) supports from reads only spanning an exon-exon junction; 7) more than 0.5 proportion of supporting reads were improper paired; 8) fewer than 2 proper-paired supporting reads.  SNVs located in exons equal or smaller than the read length, 100bp in this case, are a special case, because all their coverage may come from exon-exon junction spanning reads, so we also identified small-exonic SNVs that ware only supported by reads that spanning exon-exon junction but passed all other 7 filtering criteria mentioned above. These SNVs were finally annotated with SnpEff (Cingolani et al., 2012b) (Ensembl 66) and SnpSift (Cingolani et al., 2012a) (dbSNP137 and COSMIC64).

mRNA-Differential expression:

We used SAMseq (samr v2.0, R 2.15.0) two-class unpaired analyses with an FDR threshold of 0.05 to identify genes that were differentially expressed. For each run on a pair of sample groups, we first reduced the number of genes by removing those with median less than 5 RPKM in both groups, and those for which the Wilcoxon BH adjusted P-value between the two groups was greater than 0.05. This subset of genes was submitted to SAMseq. Each run generated a pair of files: genes ‘up’ and ‘down’. We then ranked the genes by a median-based fold change, and generated a figure showing up to 10 of the largest fold changes in each direction.

mRNA-NMF:

For specific mRNA-Seq expression datasets, we first removed genes expressed at or below a noise threshold of ≤ 0.2 reads per kilobase (of gene model) per million mapped reads (RPKM) in at least 75% of samples. We created the NMF input matrix using the top 25% most-variant genes, by ranking expressed genes having a mean RPKM of at least 10 by the coefficient of variation. We generated consensus clustering results with NMF v0.5.02 in R v1.12.0, with the default Brunet algorithm, and 200 iterations for the clustering run. Rank survey profiles for cophenetic and silhouette width suggest a specific cluster solution.

RNA-seq lite library construction (AML-IF):

For each sample, approximately 10ng of total RNA was processed using the SMART(TM) cDNA synthesis protocol including SMARTScribe Reverse Transcriptase (Clontech, #639536). This method deploys a modified oligo(dT) primer to prime the first strand synthesis reaction and a template switching mechanism to generate full-length single-stranded cDNAs containing the complete 5’ end of the mRNA as well as universal priming sequences for end-to-end amplification during 20 cycles of PCR. The amplified cDNA was subject to Illumina paired-end library construction using NEBNext paired-end DNA sample Prep Kit (NEB, E6000B-25). Libraries were sequenced on Illumina HiSeq2000 instruments.

Strand-specific RNA-seq (plate based) library construction:

Total RNA samples were checked using an Agilent Bioanalyzer RNA nanochip or Caliper GX HT RNA LabChip, and samples passing quality control were arrayed into a 96-well plate. PolyA+ RNA was purified using the 96-well MultiMACS mRNA isolation kit on the MultiMACS 96 separator (Miltenyi Biotec, Germany) from 2ug total RNA with on-column DNaseI-treatment as per the manufacturer's instructions. The eluted PolyA+ RNA was ethanol precipitated and resuspended in 10µL of DEPC treated water with 1:20 SuperaseIN (Life Technologies, USA).

 

First-stranded cDNA was synthesized from the purified polyA+RNA using the Superscript cDNA Synthesis kit (Life Technologies, USA) and random hexamer primers at a concentration of 5µM along with a final concentration of 1ug/uL Actinomycin D, followed by Ampure XP SPRI beads on a Biomek FX robot (Beckman-Coulter, USA). The second strand cDNA was synthesized following the Superscript cDNA Synthesis protocol by replacing the dTTP with dUTP in dNTP mix, allowing second strand to be digested using UNG (Uracil-N-Glycosylase, Life Technologies, USA) in the post-adapter ligation reaction and thus achieving strand specificity.

The cDNA was quantified in a 96-well format using PicoGreen (Life Technologies, USA) and VICTOR3V Spectrophotometer (PerkinElmer, Inc. USA). The cDNA was fragmented by Covaris E210 sonication for 55 seconds at a “Duty cycle” of 20% and “Intensity” of 5. The paired-end sequencing library was prepared following the BC Cancer Agency Genome Sciences Centre strand-specific, plate-based and paired-end library construction protocol on a Biomek FX robot (Beckman-Coulter, USA). Briefly, the cDNA was purified in 96-well format using Ampure XP SPRI beads, and was subject to end-repair, and phosphorylation by T4 DNA polymerase, Klenow DNA Polymerase, and T4 polynucleotide kinase respectively in a single reaction, followed by cleanup using Ampure XP SPRI beads and 3’ A-tailing by Klenow fragment (3’ to 5’ exo minus). After purification using Ampure XP SPRI beads, picogreen quantification was performed to determine the amount of Illumina PE adapters to be used in the next step of adapter ligation reaction. The adapter-ligated products were purified using Ampure XP SPRI beads, and digested with UNG (1U/ul) at 37oC for 30 min followed by deactivation at 95oC for 15 min. The digested cDNA was purified using Ampure XP SPRI beads, and then PCR-amplified with Phusion DNA Polymerase (Thermo Fisher Scientific Inc. USA) using Illumina’s PE primer set,  with cycle condition 98˚C  30sec followed by 10-13 cycles of 98˚C  10 sec, 65˚C  30 sec and 72˚C  30 sec, and then 72˚C  5min. The PCR products were purified using Ampure XP SPRI beads, and checked with Caliper LabChip GX for DNA samples using the High Sensitivity Assay (PerkinElmer, Inc. USA). PCR product of the desired size range was purified using 8% PAGE, and the DNA quality was assessed and quantified using an Agilent DNA 1000 series II assay and Quant-iT dsDNA HS Assay Kit using Qubit fluorometer (Invitrogen), then diluted to 8nM. The final library concentration was double checked and determined by Quant-iT dsDNA HS Assay again for Illumina Sequencing.

RNA-Seq/hg19 read alignment:

Illumina paired-end RNA sequencing reads were aligned to GRCh37-lite genome-plus-junctions reference using BWA version 0.5.7. This reference combined genomic sequences in the GRCh37-lite assembly and exon-exon junction sequences whose corresponding coordinates were defined based on annotations of any transcripts in Ensembl (v59), Refseq and known genes from the UCSC genome browser, which was downloaded on August 19 2010, August 8 2010, and August 19 2010, respectively. Reads that mapped to junction regions were then repositioned back to the genome, and were marked with 'ZJ:Z' tags. BWA is run using default parameters, except that the option (-s) is included to disable Smith-Waterman alignment.

Finally, reads failing the Illumina chastity filter are flagged with a custom script, and duplicated reads were flagged with Picard Tools.

 

Structural variant detection

Was performed using ABySS (v1.3.2) and trans-ABySS (v1.4.6). For RNA-seq assembly alternate k-mers from k50-k96 were performed using positive strand and ambiguous stand reads as well as negative strand and ambiguous strand reads. The positive and negative strand assemblies were extended where possible, merged and then concatenated together to produce a meta-assembly contig dataset. The genome (WGS) libraries were assembled in single end mode using k-mer values of k24, and k44. The contigs and reads were then reassembled at k64 in single end mode and then finally at k64 in paired end mode. The meta-assemblies were then used as input to the trans-ABySS analysis pipeline (Robertson et al., 2010).

Large scale rearrangements and gene fusions from RNA-seq libraries were identified from contigs that had high confidence GMAP (v2012-12-20) alignments to two distinct genomic regions.  Evidence for the alignments were provided from aligning reads back to the contigs and from aligning reads to genomic coordinates. Events were then filtered on read thresholds. Large scale rearrangements and gene fusions from WGS libraries were identified in a similar way, but using BWA (v0.6.2-r126) alignments.

Insertions and deletions were identified by gapped alignment of contigs to the human reference using GMAP for RNA-seq and BWA for WGS. Confidence in the event was calculated from the alignment of reads back to the event breakpoint in the contigs.  The events were then screened against dbSNP and other variation databases to identify putative novel events.

To determine compartment specific events the structural variant calls for each patient from all matched genome and RNA-seq samples were concatenated together and screened against matching genome tumour, and where available germline bam files. This resulted in compartment specific structural variant events and where germline was available putative somatic and germline events. The events were further filtered against a compendium of germline structural variants to remove recurrent false positives.

SNV analysis of strand-specific RNA-seq data:

After repositioning, hg19-aligned BAM files were split into positive-fragment and negative-fragment BAM files based on the orientation of the paired-end reads. Unmapped and improperly paired aligned reads were put into the mix-fragment BAM. SNVs were then detected on positive- and negative-split BAMs separately using SNVMix2 (Goya et al., 2010) with parameters Mb and Q30.  The SNVs were further filtered to exclude those called based on 1) reference base N; 2) only 1 read supports the variant; 3) probability of heterozygous and homozygous of variant allele smaller than 0.99; 4) a position overlapping with insertions or deletions; 5) read supports from positions no more than 5 bases from read ends; 6) supports from reads only spanning an exon-exon junction; 7) more than 0.5 proportion of supporting reads were improper paired; 8) fewer than 2 proper-paired supporting reads.  SNVs located in exons equal or smaller than the read length, 100bp in this case, are a special case, because all their coverage may come from exon-exon junction spanning reads, so we also identified small-exonic SNVs that ware only supported by reads that spanning exon-exon junction but passed all other 7 filtering criteria mentioned above. These SNVs were finally annotated with SnpEff (Cingolani et al., 2012b) (Ensembl 66) and SnpSift (Cingolani et al., 2012a) (dbSNP137 and COSMIC64).

mRNA-Differential expression:

We used SAMseq (samr v2.0, R 2.15.0) two-class unpaired analyses with an FDR threshold of 0.05 to identify genes that were differentially expressed. For each run on a pair of sample groups, we first reduced the number of genes by removing those with median less than 5 RPKM in both groups, and those for which the Wilcoxon BH adjusted P-value between the two groups was greater than 0.05. This subset of genes was submitted to SAMseq. Each run generated a pair of files: genes ‘up’ and ‘down’. We then ranked the genes by a median-based fold change, and generated a figure showing up to 10 of the largest fold changes in each direction.

mRNA-NMF:

For specific mRNA-Seq expression datasets, we first removed genes expressed at or below a noise threshold of ≤ 0.2 reads per kilobase (of gene model) per million mapped reads (RPKM) in at least 75% of samples. We created the NMF input matrix using the top 25% most-variant genes, by ranking expressed genes having a mean RPKM of at least 10 by the coefficient of variation. We generated consensus clustering results with NMF v0.5.02 in R v1.12.0, with the default Brunet algorithm, and 200 iterations for the clustering run. Rank survey profiles for cophenetic and silhouette width suggest a specific cluster solution.

RNA-seq lite library construction (AML-IF):

For each sample, approximately 10ng of total RNA was processed using the SMART(TM) cDNA synthesis protocol including SMARTScribe Reverse Transcriptase (Clontech, #639536). This method deploys a modified oligo(dT) primer to prime the first strand synthesis reaction and a template switching mechanism to generate full-length single-stranded cDNAs containing the complete 5’ end of the mRNA as well as universal priming sequences for end-to-end amplification during 20 cycles of PCR. The amplified cDNA was subject to Illumina paired-end library construction using NEBNext paired-end DNA sample Prep Kit (NEB, E6000B-25). Libraries were sequenced on Illumina HiSeq2000 instruments.

Strand-specific RNA-seq (plate based) library construction:

Total RNA samples were checked using an Agilent Bioanalyzer RNA nanochip or Caliper GX HT RNA LabChip, and samples passing quality control were arrayed into a 96-well plate. PolyA+ RNA was purified using the 96-well MultiMACS mRNA isolation kit on the MultiMACS 96 separator (Miltenyi Biotec, Germany) from 2ug total RNA with on-column DNaseI-treatment as per the manufacturer's instructions. The eluted PolyA+ RNA was ethanol precipitated and resuspended in 10µL of DEPC treated water with 1:20 SuperaseIN (Life Technologies, USA).

 

First-stranded cDNA was synthesized from the purified polyA+RNA using the Superscript cDNA Synthesis kit (Life Technologies, USA) and random hexamer primers at a concentration of 5µM along with a final concentration of 1ug/uL Actinomycin D, followed by Ampure XP SPRI beads on a Biomek FX robot (Beckman-Coulter, USA). The second strand cDNA was synthesized following the Superscript cDNA Synthesis protocol by replacing the dTTP with dUTP in dNTP mix, allowing second strand to be digested using UNG (Uracil-N-Glycosylase, Life Technologies, USA) in the post-adapter ligation reaction and thus achieving strand specificity.

The cDNA was quantified in a 96-well format using PicoGreen (Life Technologies, USA) and VICTOR3V Spectrophotometer (PerkinElmer, Inc. USA). The cDNA was fragmented by Covaris E210 sonication for 55 seconds at a “Duty cycle” of 20% and “Intensity” of 5. The paired-end sequencing library was prepared following the BC Cancer Agency Genome Sciences Centre strand-specific, plate-based and paired-end library construction protocol on a Biomek FX robot (Beckman-Coulter, USA). Briefly, the cDNA was purified in 96-well format using Ampure XP SPRI beads, and was subject to end-repair, and phosphorylation by T4 DNA polymerase, Klenow DNA Polymerase, and T4 polynucleotide kinase respectively in a single reaction, followed by cleanup using Ampure XP SPRI beads and 3’ A-tailing by Klenow fragment (3’ to 5’ exo minus). After purification using Ampure XP SPRI beads, picogreen quantification was performed to determine the amount of Illumina PE adapters to be used in the next step of adapter ligation reaction. The adapter-ligated products were purified using Ampure XP SPRI beads, and digested with UNG (1U/ul) at 37oC for 30 min followed by deactivation at 95oC for 15 min. The digested cDNA was purified using Ampure XP SPRI beads, and then PCR-amplified with Phusion DNA Polymerase (Thermo Fisher Scientific Inc. USA) using Illumina’s PE primer set,  with cycle condition 98˚C  30sec followed by 10-13 cycles of 98˚C  10 sec, 65˚C  30 sec and 72˚C  30 sec, and then 72˚C  5min. The PCR products were purified using Ampure XP SPRI beads, and checked with Caliper LabChip GX for DNA samples using the High Sensitivity Assay (PerkinElmer, Inc. USA). PCR product of the desired size range was purified using 8% PAGE, and the DNA quality was assessed and quantified using an Agilent DNA 1000 series II assay and Quant-iT dsDNA HS Assay Kit using Qubit fluorometer (Invitrogen), then diluted to 8nM. The final library concentration was double checked and determined by Quant-iT dsDNA HS Assay again for Illumina Sequencing.

RNA-Seq/hg19 read alignment:

Illumina paired-end RNA sequencing reads were aligned to GRCh37-lite genome-plus-junctions reference using BWA version 0.5.7. This reference combined genomic sequences in the GRCh37-lite assembly and exon-exon junction sequences whose corresponding coordinates were defined based on annotations of any transcripts in Ensembl (v59), Refseq and known genes from the UCSC genome browser, which was downloaded on August 19 2010, August 8 2010, and August 19 2010, respectively. Reads that mapped to junction regions were then repositioned back to the genome, and were marked with 'ZJ:Z' tags. BWA is run using default parameters, except that the option (-s) is included to disable Smith-Waterman alignment.

Finally, reads failing the Illumina chastity filter are flagged with a custom script, and duplicated reads were flagged with Picard Tools.

 

Structural variant detection

Was performed using ABySS (v1.3.2) and trans-ABySS (v1.4.6). For RNA-seq assembly alternate k-mers from k50-k96 were performed using positive strand and ambiguous stand reads as well as negative strand and ambiguous strand reads. The positive and negative strand assemblies were extended where possible, merged and then concatenated together to produce a meta-assembly contig dataset. The genome (WGS) libraries were assembled in single end mode using k-mer values of k24, and k44. The contigs and reads were then reassembled at k64 in single end mode and then finally at k64 in paired end mode. The meta-assemblies were then used as input to the trans-ABySS analysis pipeline (Robertson et al., 2010).

Large scale rearrangements and gene fusions from RNA-seq libraries were identified from contigs that had high confidence GMAP (v2012-12-20) alignments to two distinct genomic regions.  Evidence for the alignments were provided from aligning reads back to the contigs and from aligning reads to genomic coordinates. Events were then filtered on read thresholds. Large scale rearrangements and gene fusions from WGS libraries were identified in a similar way, but using BWA (v0.6.2-r126) alignments.

Insertions and deletions were identified by gapped alignment of contigs to the human reference using GMAP for RNA-seq and BWA for WGS. Confidence in the event was calculated from the alignment of reads back to the event breakpoint in the contigs.  The events were then screened against dbSNP and other variation databases to identify putative novel events.

To determine compartment specific events the structural variant calls for each patient from all matched genome and RNA-seq samples were concatenated together and screened against matching genome tumour, and where available germline bam files. This resulted in compartment specific structural variant events and where germline was available putative somatic and germline events. The events were further filtered against a compendium of germline structural variants to remove recurrent false positives.

SNV analysis of strand-specific RNA-seq data:

After repositioning, hg19-aligned BAM files were split into positive-fragment and negative-fragment BAM files based on the orientation of the paired-end reads. Unmapped and improperly paired aligned reads were put into the mix-fragment BAM. SNVs were then detected on positive- and negative-split BAMs separately using SNVMix2 (Goya et al., 2010) with parameters Mb and Q30.  The SNVs were further filtered to exclude those called based on 1) reference base N; 2) only 1 read supports the variant; 3) probability of heterozygous and homozygous of variant allele smaller than 0.99; 4) a position overlapping with insertions or deletions; 5) read supports from positions no more than 5 bases from read ends; 6) supports from reads only spanning an exon-exon junction; 7) more than 0.5 proportion of supporting reads were improper paired; 8) fewer than 2 proper-paired supporting reads.  SNVs located in exons equal or smaller than the read length, 100bp in this case, are a special case, because all their coverage may come from exon-exon junction spanning reads, so we also identified small-exonic SNVs that ware only supported by reads that spanning exon-exon junction but passed all other 7 filtering criteria mentioned above. These SNVs were finally annotated with SnpEff (Cingolani et al., 2012b) (Ensembl 66) and SnpSift (Cingolani et al., 2012a) (dbSNP137 and COSMIC64).

mRNA-Differential expression:

We used SAMseq (samr v2.0, R 2.15.0) two-class unpaired analyses with an FDR threshold of 0.05 to identify genes that were differentially expressed. For each run on a pair of sample groups, we first reduced the number of genes by removing those with median less than 5 RPKM in both groups, and those for which the Wilcoxon BH adjusted P-value between the two groups was greater than 0.05. This subset of genes was submitted to SAMseq. Each run generated a pair of files: genes ‘up’ and ‘down’. We then ranked the genes by a median-based fold change, and generated a figure showing up to 10 of the largest fold changes in each direction.

mRNA-NMF:

For specific mRNA-Seq expression datasets, we first removed genes expressed at or below a noise threshold of ≤ 0.2 reads per kilobase (of gene model) per million mapped reads (RPKM) in at least 75% of samples. We created the NMF input matrix using the top 25% most-variant genes, by ranking expressed genes having a mean RPKM of at least 10 by the coefficient of variation. We generated consensus clustering results with NMF v0.5.02 in R v1.12.0, with the default Brunet algorithm, and 200 iterations for the clustering run. Rank survey profiles for cophenetic and silhouette width suggest a specific cluster solution.

RNA-seq lite library construction (AML-IF):

For each sample, approximately 10ng of total RNA was processed using the SMART(TM) cDNA synthesis protocol including SMARTScribe Reverse Transcriptase (Clontech, #639536). This method deploys a modified oligo(dT) primer to prime the first strand synthesis reaction and a template switching mechanism to generate full-length single-stranded cDNAs containing the complete 5’ end of the mRNA as well as universal priming sequences for end-to-end amplification during 20 cycles of PCR. The amplified cDNA was subject to Illumina paired-end library construction using NEBNext paired-end DNA sample Prep Kit (NEB, E6000B-25). Libraries were sequenced on Illumina HiSeq2000 instruments.

Strand-specific RNA-seq (plate based) library construction:

Total RNA samples were checked using an Agilent Bioanalyzer RNA nanochip or Caliper GX HT RNA LabChip, and samples passing quality control were arrayed into a 96-well plate. PolyA+ RNA was purified using the 96-well MultiMACS mRNA isolation kit on the MultiMACS 96 separator (Miltenyi Biotec, Germany) from 2ug total RNA with on-column DNaseI-treatment as per the manufacturer's instructions. The eluted PolyA+ RNA was ethanol precipitated and resuspended in 10µL of DEPC treated water with 1:20 SuperaseIN (Life Technologies, USA).

 

First-stranded cDNA was synthesized from the purified polyA+RNA using the Superscript cDNA Synthesis kit (Life Technologies, USA) and random hexamer primers at a concentration of 5µM along with a final concentration of 1ug/uL Actinomycin D, followed by Ampure XP SPRI beads on a Biomek FX robot (Beckman-Coulter, USA). The second strand cDNA was synthesized following the Superscript cDNA Synthesis protocol by replacing the dTTP with dUTP in dNTP mix, allowing second strand to be digested using UNG (Uracil-N-Glycosylase, Life Technologies, USA) in the post-adapter ligation reaction and thus achieving strand specificity.

The cDNA was quantified in a 96-well format using PicoGreen (Life Technologies, USA) and VICTOR3V Spectrophotometer (PerkinElmer, Inc. USA). The cDNA was fragmented by Covaris E210 sonication for 55 seconds at a “Duty cycle” of 20% and “Intensity” of 5. The paired-end sequencing library was prepared following the BC Cancer Agency Genome Sciences Centre strand-specific, plate-based and paired-end library construction protocol on a Biomek FX robot (Beckman-Coulter, USA). Briefly, the cDNA was purified in 96-well format using Ampure XP SPRI beads, and was subject to end-repair, and phosphorylation by T4 DNA polymerase, Klenow DNA Polymerase, and T4 polynucleotide kinase respectively in a single reaction, followed by cleanup using Ampure XP SPRI beads and 3’ A-tailing by Klenow fragment (3’ to 5’ exo minus). After purification using Ampure XP SPRI beads, picogreen quantification was performed to determine the amount of Illumina PE adapters to be used in the next step of adapter ligation reaction. The adapter-ligated products were purified using Ampure XP SPRI beads, and digested with UNG (1U/ul) at 37oC for 30 min followed by deactivation at 95oC for 15 min. The digested cDNA was purified using Ampure XP SPRI beads, and then PCR-amplified with Phusion DNA Polymerase (Thermo Fisher Scientific Inc. USA) using Illumina’s PE primer set,  with cycle condition 98˚C  30sec followed by 10-13 cycles of 98˚C  10 sec, 65˚C  30 sec and 72˚C  30 sec, and then 72˚C  5min. The PCR products were purified using Ampure XP SPRI beads, and checked with Caliper LabChip GX for DNA samples using the High Sensitivity Assay (PerkinElmer, Inc. USA). PCR product of the desired size range was purified using 8% PAGE, and the DNA quality was assessed and quantified using an Agilent DNA 1000 series II assay and Quant-iT dsDNA HS Assay Kit using Qubit fluorometer (Invitrogen), then diluted to 8nM. The final library concentration was double checked and determined by Quant-iT dsDNA HS Assay again for Illumina Sequencing.

RNA-Seq/hg19 read alignment:

Illumina paired-end RNA sequencing reads were aligned to GRCh37-lite genome-plus-junctions reference using BWA version 0.5.7. This reference combined genomic sequences in the GRCh37-lite assembly and exon-exon junction sequences whose corresponding coordinates were defined based on annotations of any transcripts in Ensembl (v59), Refseq and known genes from the UCSC genome browser, which was downloaded on August 19 2010, August 8 2010, and August 19 2010, respectively. Reads that mapped to junction regions were then repositioned back to the genome, and were marked with 'ZJ:Z' tags. BWA is run using default parameters, except that the option (-s) is included to disable Smith-Waterman alignment.

Finally, reads failing the Illumina chastity filter are flagged with a custom script, and duplicated reads were flagged with Picard Tools.

 

Structural variant detection

Was performed using ABySS (v1.3.2) and trans-ABySS (v1.4.6). For RNA-seq assembly alternate k-mers from k50-k96 were performed using positive strand and ambiguous stand reads as well as negative strand and ambiguous strand reads. The positive and negative strand assemblies were extended where possible, merged and then concatenated together to produce a meta-assembly contig dataset. The genome (WGS) libraries were assembled in single end mode using k-mer values of k24, and k44. The contigs and reads were then reassembled at k64 in single end mode and then finally at k64 in paired end mode. The meta-assemblies were then used as input to the trans-ABySS analysis pipeline (Robertson et al., 2010).

Large scale rearrangements and gene fusions from RNA-seq libraries were identified from contigs that had high confidence GMAP (v2012-12-20) alignments to two distinct genomic regions.  Evidence for the alignments were provided from aligning reads back to the contigs and from aligning reads to genomic coordinates. Events were then filtered on read thresholds. Large scale rearrangements and gene fusions from WGS libraries were identified in a similar way, but using BWA (v0.6.2-r126) alignments.

Insertions and deletions were identified by gapped alignment of contigs to the human reference using GMAP for RNA-seq and BWA for WGS. Confidence in the event was calculated from the alignment of reads back to the event breakpoint in the contigs.  The events were then screened against dbSNP and other variation databases to identify putative novel events.

To determine compartment specific events the structural variant calls for each patient from all matched genome and RNA-seq samples were concatenated together and screened against matching genome tumour, and where available germline bam files. This resulted in compartment specific structural variant events and where germline was available putative somatic and germline events. The events were further filtered against a compendium of germline structural variants to remove recurrent false positives.

SNV analysis of strand-specific RNA-seq data:

After repositioning, hg19-aligned BAM files were split into positive-fragment and negative-fragment BAM files based on the orientation of the paired-end reads. Unmapped and improperly paired aligned reads were put into the mix-fragment BAM. SNVs were then detected on positive- and negative-split BAMs separately using SNVMix2 (Goya et al., 2010) with parameters Mb and Q30.  The SNVs were further filtered to exclude those called based on 1) reference base N; 2) only 1 read supports the variant; 3) probability of heterozygous and homozygous of variant allele smaller than 0.99; 4) a position overlapping with insertions or deletions; 5) read supports from positions no more than 5 bases from read ends; 6) supports from reads only spanning an exon-exon junction; 7) more than 0.5 proportion of supporting reads were improper paired; 8) fewer than 2 proper-paired supporting reads.  SNVs located in exons equal or smaller than the read length, 100bp in this case, are a special case, because all their coverage may come from exon-exon junction spanning reads, so we also identified small-exonic SNVs that ware only supported by reads that spanning exon-exon junction but passed all other 7 filtering criteria mentioned above. These SNVs were finally annotated with SnpEff (Cingolani et al., 2012b) (Ensembl 66) and SnpSift (Cingolani et al., 2012a) (dbSNP137 and COSMIC64).

mRNA-Differential expression:

We used SAMseq (samr v2.0, R 2.15.0) two-class unpaired analyses with an FDR threshold of 0.05 to identify genes that were differentially expressed. For each run on a pair of sample groups, we first reduced the number of genes by removing those with median less than 5 RPKM in both groups, and those for which the Wilcoxon BH adjusted P-value between the two groups was greater than 0.05. This subset of genes was submitted to SAMseq. Each run generated a pair of files: genes ‘up’ and ‘down’. We then ranked the genes by a median-based fold change, and generated a figure showing up to 10 of the largest fold changes in each direction.

mRNA-NMF:

For specific mRNA-Seq expression datasets, we first removed genes expressed at or below a noise threshold of ≤ 0.2 reads per kilobase (of gene model) per million mapped reads (RPKM) in at least 75% of samples. We created the NMF input matrix using the top 25% most-variant genes, by ranking expressed genes having a mean RPKM of at least 10 by the coefficient of variation. We generated consensus clustering results with NMF v0.5.02 in R v1.12.0, with the default Brunet algorithm, and 200 iterations for the clustering run. Rank survey profiles for cophenetic and silhouette width suggest a specific cluster solution.

RNA-seq lite library construction (AML-IF):

For each sample, approximately 10ng of total RNA was processed using the SMART(TM) cDNA synthesis protocol including SMARTScribe Reverse Transcriptase (Clontech, #639536). This method deploys a modified oligo(dT) primer to prime the first strand synthesis reaction and a template switching mechanism to generate full-length single-stranded cDNAs containing the complete 5’ end of the mRNA as well as universal priming sequences for end-to-end amplification during 20 cycles of PCR. The amplified cDNA was subject to Illumina paired-end library construction using NEBNext paired-end DNA sample Prep Kit (NEB, E6000B-25). Libraries were sequenced on Illumina HiSeq2000 instruments.

Strand-specific RNA-seq (plate based) library construction:

Total RNA samples were checked using an Agilent Bioanalyzer RNA nanochip or Caliper GX HT RNA LabChip, and samples passing quality control were arrayed into a 96-well plate. PolyA+ RNA was purified using the 96-well MultiMACS mRNA isolation kit on the MultiMACS 96 separator (Miltenyi Biotec, Germany) from 2ug total RNA with on-column DNaseI-treatment as per the manufacturer's instructions. The eluted PolyA+ RNA was ethanol precipitated and resuspended in 10µL of DEPC treated water with 1:20 SuperaseIN (Life Technologies, USA).

 

First-stranded cDNA was synthesized from the purified polyA+RNA using the Superscript cDNA Synthesis kit (Life Technologies, USA) and random hexamer primers at a concentration of 5µM along with a final concentration of 1ug/uL Actinomycin D, followed by Ampure XP SPRI beads on a Biomek FX robot (Beckman-Coulter, USA). The second strand cDNA was synthesized following the Superscript cDNA Synthesis protocol by replacing the dTTP with dUTP in dNTP mix, allowing second strand to be digested using UNG (Uracil-N-Glycosylase, Life Technologies, USA) in the post-adapter ligation reaction and thus achieving strand specificity.

The cDNA was quantified in a 96-well format using PicoGreen (Life Technologies, USA) and VICTOR3V Spectrophotometer (PerkinElmer, Inc. USA). The cDNA was fragmented by Covaris E210 sonication for 55 seconds at a “Duty cycle” of 20% and “Intensity” of 5. The paired-end sequencing library was prepared following the BC Cancer Agency Genome Sciences Centre strand-specific, plate-based and paired-end library construction protocol on a Biomek FX robot (Beckman-Coulter, USA). Briefly, the cDNA was purified in 96-well format using Ampure XP SPRI beads, and was subject to end-repair, and phosphorylation by T4 DNA polymerase, Klenow DNA Polymerase, and T4 polynucleotide kinase respectively in a single reaction, followed by cleanup using Ampure XP SPRI beads and 3’ A-tailing by Klenow fragment (3’ to 5’ exo minus). After purification using Ampure XP SPRI beads, picogreen quantification was performed to determine the amount of Illumina PE adapters to be used in the next step of adapter ligation reaction. The adapter-ligated products were purified using Ampure XP SPRI beads, and digested with UNG (1U/ul) at 37oC for 30 min followed by deactivation at 95oC for 15 min. The digested cDNA was purified using Ampure XP SPRI beads, and then PCR-amplified with Phusion DNA Polymerase (Thermo Fisher Scientific Inc. USA) using Illumina’s PE primer set,  with cycle condition 98˚C  30sec followed by 10-13 cycles of 98˚C  10 sec, 65˚C  30 sec and 72˚C  30 sec, and then 72˚C  5min. The PCR products were purified using Ampure XP SPRI beads, and checked with Caliper LabChip GX for DNA samples using the High Sensitivity Assay (PerkinElmer, Inc. USA). PCR product of the desired size range was purified using 8% PAGE, and the DNA quality was assessed and quantified using an Agilent DNA 1000 series II assay and Quant-iT dsDNA HS Assay Kit using Qubit fluorometer (Invitrogen), then diluted to 8nM. The final library concentration was double checked and determined by Quant-iT dsDNA HS Assay again for Illumina Sequencing.

Strand-specific ribodepletion RNA sequencing:

Enzymatic reactions were set-up in a 96-well plate (Thermo Fisher Scientific) on a Microlab NIMBUS liquid handler (Hamilton Robotics, USA). 100ng of DNase I treated total RNA in 6 µL was hybridized to rRNA probes in a 7.5 µL reaction. Heat-sealed plates were incubated at 95oC for 2 minutes followed by incremental reduction in temperature by 0.1oC per second to 22oC (730 cycles). The rRNA in DNA hybrids were digested using RNase H in a 10 µL reaction incubated in a thermocycler at 37oC for 30 minutes. To remove excess rRNA probes (DNA) and residual genomic DNA contamination, DNase I was added in a total reaction volume of 25 µL and incubated at 37oC for 30 minutes. RNA was purified using RNA MagClean DX beads (Aline Biosciences, USA) with 15 minutes of binding time, 7 minutes clearing on a magnet followed by two 70% ethanol washes, 5 minutes to air dry the RNA pellet and elution in 36uL DEPC water. The plate containing RNA was stored at -80oC prior to cDNA synthesis.

First-strand cDNA was synthesized from the purified RNA (minus rRNA) using the Maxima H Minus First Strand cDNA Synthesis kit (Thermo-Fisher, USA) and random hexamer primers at a concentration of 8ng/µL along with a final concentration of 0.4µg/µL Actinomycin D, followed by PCR Clean DX bead purification on a Microlab NIMBUS robot (Hamilton Robotics, USA). The second strand cDNA was synthesized following the NEBNext Ultra Directional Second Strand cDNA Synthesis protocol (NEB) that incorporates dUTP in the dNTP mix, allowing the second strand to be digested using USERTM enzyme (NEB) in the post-adapter ligation reaction and thus achieving strand specificity.

cDNA was fragmented by Covaris LE220 sonication for 130seconds (2x65seconds) at a “Duty cycle” of 30%, 450 Peak Incident Power (W) and 200 Cycles per Burst in a 96-well microTUBE Plate (P/N: 520078) to achieve 200-250 bp average fragment lengths. The paired-end sequencing library was prepared following the BC Cancer Agency Genome Sciences Centre strand-specific, plate-based library construction protocol on a Microlab NIMBUS robot (Hamilton Robotics, USA). Briefly, the sheared cDNA was subject to end-repair and phosphorylation in a single reaction using an enzyme premix (NEB) containing T4 DNA polymerase, Klenow DNA Polymerase and T4 polynucleotide kinase, incubated at 20oC for 30 minutes. Repaired cDNA was purified in 96-well format using PCR Clean DX beads (Aline Biosciences, USA), and 3’ A-tailed (adenylation) using Klenow fragment (3’ to 5’ exo minus) and incubation at 37oC for 30 minutes prior to enzyme heat inactivation. Illumina PE adapters were ligated at 20oC for 15 minutes. The adapter-ligated products were purified using PCR Clean DX beads, then digested with USERTM enzyme (1 U/µL, NEB) at 37oC for 15 minutes followed immediately by 13 cycles of indexed PCR using Phusion DNA Polymerase (Thermo Fisher Scientific Inc. USA) and Illumina’s PE primer set. PCR parameters: 98˚C for 1 minute followed by 13 cycles of 98˚C 15 seconds, 65˚C 30 seconds and 72˚C 30 seconds, and then 72˚C 5 minutes. The PCR products were purified and size-selected using a 1:1 PCR Clean DX beads-to-sample ratio (twice), and the eluted DNA quality was assessed with Caliper LabChip GX for DNA samples using the High Sensitivity Assay (PerkinElmer, Inc. USA) and quantified using a Quant-iT dsDNA High Sensitivity Assay Kit on a Qubit fluorometer (Invitrogen) prior to library pooling and size-corrected final molar concentration calculation for Illumina HiSeq2500 sequencing with paired-end 75 base reads.

RNA-Seq (plate-based) library construction (pre-2014):

2-3 ug total RNA samples were arrayed into a 96-well plate and polyadenylated (PolyA+) RNA was purified using the 96-well MultiMACS mRNA isolation kit on the MultiMACS 96 separator (Miltenyi Biotec, Germany) with on column DNaseI-treatment as per the manufacturer's instructions. The eluted PolyA+ RNA was ethanol precipitated and resuspended in 10µL of DEPC treated water with 1:20 SuperaseIN (Life Technologies, USA). Double-stranded cDNA was synthesized from the purified polyA+RNA using the Superscript Double-Stranded cDNA Synthesis kit (Life Technologies, USA) and random hexamer primers at a concentration of 5µM. The cDNA was quantified in a 96-well format using PicoGreen (Life Technologies, USA) and VICTOR3V Spectrophotometer (PerkinElmer, Inc. USA). The quality was checked for a random sampling on an Agilent Bioanalyzer using the High Sensitivity DNA chip Assay.  cDNA was fragmented by Covaris E210 (Covaris, USA) sonication for 55 seconds, a “Duty cycle” of 20% and “Intensity” of 5. Plate-based libraries were prepared following the BC Cancer Agency, Genome Sciences Centre paired-end (PE) protocol on a Biomek FX robot (Beckman-Coulter, USA). Briefly, the cDNA was purified in 96-well format using Ampure XP SPRI beads, and was subject to end-repair and phosphorylation by T4 DNA polymerase, Klenow DNA Polymerase, and T4 polynucleotide kinase respectively in a single reaction, followed by cleanup using Ampure XP SPRI beads and 3’ A-tailling by Klenow fragment (3’ to 5’ exo minus). After cleanup using Ampure XP SPRI beads, picogreen quantification was performed to determine the amount of Illumina PE adapters used in the next step of adapter ligation reaction. The adapter-ligated products were purified using Ampure XP SPRI beads, then PCR-amplified with Phusion DNA Polymerase (Thermo Fisher Scientific Inc. USA) using Illumina’s PE primer set,  with cycle condition 98˚C  30sec followed by 10-15 cycles of 98˚C  10 sec, 65˚C  30 sec and 72˚C  30 sec, and then 72˚C  5min. The PCR products were purified using Ampure XP SPRI beads, and checked with Caliper LabChip GX for DNA samples using the High Sensitivity Assay (PerkinElmer, Inc. USA). PCR product of desired size range was purified using an in-house 96-channel size selection robot, and the DNA quality was assessed and quantified using an Agilent DNA 1000 series II assay and Quant-iT dsDNA HS Assay Kit using Qubit fluorometer (Invitrogen), then diluted to 8nM. The final concentration was verified by Quant-iT dsDNA HS Assay prior to Illumina HiSeq2000 PE 75 base sequencing.

Strand-specific RNA-seq (plate based) library construction (post-2014):

Total RNA samples were checked using an Agilent Bioanalyzer RNA nanochip or Caliper GX HT RNA LabChip, and samples passing quality control were arrayed into a 96-well plate. PolyA+ RNA was purified using the 96-well MultiMACS mRNA isolation kit on the MultiMACS 96 separator (Miltenyi Biotec, Germany) from 2ug total RNA with on-column DNaseI-treatment as per the manufacturer's instructions. The eluted PolyA+ RNA was ethanol precipitated and resuspended in 10µL of DEPC treated water with 1:20 SuperaseIN (Life Technologies, USA).

 

First-stranded cDNA was synthesized from the purified polyA+RNA using the Superscript cDNA Synthesis kit (Life Technologies, USA) and random hexamer primers at a concentration of 5µM along with a final concentration of 1ug/uL Actinomycin D, followed by Ampure XP SPRI beads on a Biomek FX robot (Beckman-Coulter, USA). The second strand cDNA was synthesized following the Superscript cDNA Synthesis protocol by replacing the dTTP with dUTP in dNTP mix, allowing second strand to be digested using UNG (Uracil-N-Glycosylase, Life Technologies, USA) in the post-adapter ligation reaction and thus achieving strand specificity.

The cDNA was quantified in a 96-well format using PicoGreen (Life Technologies, USA) and VICTOR3V Spectrophotometer (PerkinElmer, Inc. USA). The cDNA was fragmented by Covaris E210 sonication for 55 seconds at a “Duty cycle” of 20% and “Intensity” of 5. The paired-end sequencing library was prepared following the BC Cancer Agency Genome Sciences Centre strand-specific, plate-based and paired-end library construction protocol on a Biomek FX robot (Beckman-Coulter, USA). Briefly, the cDNA was purified in 96-well format using Ampure XP SPRI beads, and was subject to end-repair, and phosphorylation by T4 DNA polymerase, Klenow DNA Polymerase, and T4 polynucleotide kinase respectively in a single reaction, followed by cleanup using Ampure XP SPRI beads and 3’ A-tailing by Klenow fragment (3’ to 5’ exo minus). After purification using Ampure XP SPRI beads, picogreen quantification was performed to determine the amount of Illumina PE adapters to be used in the next step of adapter ligation reaction. The adapter-ligated products were purified using Ampure XP SPRI beads, and digested with UNG (1U/ul) at 37oC for 30 min followed by deactivation at 95oC for 15 min. The digested cDNA was purified using Ampure XP SPRI beads, and then PCR-amplified with Phusion DNA Polymerase (Thermo Fisher Scientific Inc. USA) using Illumina’s PE primer set,  with cycle condition 98˚C  30sec followed by 10-13 cycles of 98˚C  10 sec, 65˚C  30 sec and 72˚C  30 sec, and then 72˚C  5min. The PCR products were purified using Ampure XP SPRI beads, and checked with Caliper LabChip GX for DNA samples using the High Sensitivity Assay (PerkinElmer, Inc. USA). PCR product of the desired size range was purified using 8% PAGE, and the DNA quality was assessed and quantified using an Agilent DNA 1000 series II assay and Quant-iT dsDNA HS Assay Kit using Qubit fluorometer (Invitrogen), then diluted to 8nM. The final library concentration was double checked and determined by Quant-iT dsDNA HS Assay again for Illumina Sequencing.

RNA-Seq/hg19 read alignment:

Illumina paired-end RNA sequencing reads were aligned to GRCh37-lite genome-plus-junctions reference using BWA version 0.5.7. This reference combined genomic sequences in the GRCh37-lite assembly and exon-exon junction sequences whose corresponding coordinates were defined based on annotations of any transcripts in Ensembl (v59), Refseq and known genes from the UCSC genome browser, which was downloaded on August 19 2010, August 8 2010, and August 19 2010, respectively. Reads that mapped to junction regions were then repositioned back to the genome, and were marked with 'ZJ:Z' tags. BWA is run using default parameters, except that the option (-s) is included to disable Smith-Waterman alignment.

Finally, reads failing the Illumina chastity filter are flagged with a custom script, and duplicated reads were flagged with Picard Tools.

 

Structural variant detection

Was performed using ABySS (v1.3.2) and trans-ABySS (v1.4.6). For RNA-seq assembly alternate k-mers from k50-k96 were performed using positive strand and ambiguous stand reads as well as negative strand and ambiguous strand reads. The positive and negative strand assemblies were extended where possible, merged and then concatenated together to produce a meta-assembly contig dataset. The genome (WGS) libraries were assembled in single end mode using k-mer values of k24, and k44. The contigs and reads were then reassembled at k64 in single end mode and then finally at k64 in paired end mode. The meta-assemblies were then used as input to the trans-ABySS analysis pipeline (Robertson et al., 2010).

Large scale rearrangements and gene fusions from RNA-seq libraries were identified from contigs that had high confidence GMAP (v2012-12-20) alignments to two distinct genomic regions.  Evidence for the alignments were provided from aligning reads back to the contigs and from aligning reads to genomic coordinates. Events were then filtered on read thresholds. Large scale rearrangements and gene fusions from WGS libraries were identified in a similar way, but using BWA (v0.6.2-r126) alignments.

Insertions and deletions were identified by gapped alignment of contigs to the human reference using GMAP for RNA-seq and BWA for WGS. Confidence in the event was calculated from the alignment of reads back to the event breakpoint in the contigs.  The events were then screened against dbSNP and other variation databases to identify putative novel events.

To determine compartment specific events the structural variant calls for each patient from all matched genome and RNA-seq samples were concatenated together and screened against matching genome tumour, and where available germline bam files. This resulted in compartment specific structural variant events and where germline was available putative somatic and germline events. The events were further filtered against a compendium of germline structural variants to remove recurrent false positives.

SNV analysis of strand-specific RNA-seq data:

After repositioning, hg19-aligned BAM files were split into positive-fragment and negative-fragment BAM files based on the orientation of the paired-end reads. Unmapped and improperly paired aligned reads were put into the mix-fragment BAM. SNVs were then detected on positive- and negative-split BAMs separately using SNVMix2 (Goya et al., 2010) with parameters Mb and Q30.  The SNVs were further filtered to exclude those called based on 1) reference base N; 2) only 1 read supports the variant; 3) probability of heterozygous and homozygous of variant allele smaller than 0.99; 4) a position overlapping with insertions or deletions; 5) read supports from positions no more than 5 bases from read ends; 6) supports from reads only spanning an exon-exon junction; 7) more than 0.5 proportion of supporting reads were improper paired; 8) fewer than 2 proper-paired supporting reads.  SNVs located in exons equal or smaller than the read length, 100bp in this case, are a special case, because all their coverage may come from exon-exon junction spanning reads, so we also identified small-exonic SNVs that ware only supported by reads that spanning exon-exon junction but passed all other 7 filtering criteria mentioned above. These SNVs were finally annotated with SnpEff (Cingolani et al., 2012b) (Ensembl 66) and SnpSift (Cingolani et al., 2012a) (dbSNP137 and COSMIC64).

mRNA-Differential expression:

We used SAMseq (samr v2.0, R 2.15.0) two-class unpaired analyses with an FDR threshold of 0.05 to identify genes that were differentially expressed. For each run on a pair of sample groups, we first reduced the number of genes by removing those with median less than 5 RPKM in both groups, and those for which the Wilcoxon BH adjusted P-value between the two groups was greater than 0.05. This subset of genes was submitted to SAMseq. Each run generated a pair of files: genes ‘up’ and ‘down’. We then ranked the genes by a median-based fold change, and generated a figure showing up to 10 of the largest fold changes in each direction.

mRNA-NMF:

For specific mRNA-Seq expression datasets, we first removed genes expressed at or below a noise threshold of ≤ 0.2 reads per kilobase (of gene model) per million mapped reads (RPKM) in at least 75% of samples. We created the NMF input matrix using the top 25% most-variant genes, by ranking expressed genes having a mean RPKM of at least 10 by the coefficient of variation. We generated consensus clustering results with NMF v0.5.02 in R v1.12.0, with the default Brunet algorithm, and 200 iterations for the clustering run. Rank survey profiles for cophenetic and silhouette width suggest a specific cluster solution.

RNA-Seq (plate-based) library construction (pre-2014):

2-3 ug total RNA samples were arrayed into a 96-well plate and polyadenylated (PolyA+) RNA was purified using the 96-well MultiMACS mRNA isolation kit on the MultiMACS 96 separator (Miltenyi Biotec, Germany) with on column DNaseI-treatment as per the manufacturer's instructions. The eluted PolyA+ RNA was ethanol precipitated and resuspended in 10µL of DEPC treated water with 1:20 SuperaseIN (Life Technologies, USA). Double-stranded cDNA was synthesized from the purified polyA+RNA using the Superscript Double-Stranded cDNA Synthesis kit (Life Technologies, USA) and random hexamer primers at a concentration of 5µM. The cDNA was quantified in a 96-well format using PicoGreen (Life Technologies, USA) and VICTOR3V Spectrophotometer (PerkinElmer, Inc. USA). The quality was checked for a random sampling on an Agilent Bioanalyzer using the High Sensitivity DNA chip Assay.  cDNA was fragmented by Covaris E210 (Covaris, USA) sonication for 55 seconds, a “Duty cycle” of 20% and “Intensity” of 5. Plate-based libraries were prepared following the BC Cancer Agency, Genome Sciences Centre paired-end (PE) protocol on a Biomek FX robot (Beckman-Coulter, USA). Briefly, the cDNA was purified in 96-well format using Ampure XP SPRI beads, and was subject to end-repair and phosphorylation by T4 DNA polymerase, Klenow DNA Polymerase, and T4 polynucleotide kinase respectively in a single reaction, followed by cleanup using Ampure XP SPRI beads and 3’ A-tailling by Klenow fragment (3’ to 5’ exo minus). After cleanup using Ampure XP SPRI beads, picogreen quantification was performed to determine the amount of Illumina PE adapters used in the next step of adapter ligation reaction. The adapter-ligated products were purified using Ampure XP SPRI beads, then PCR-amplified with Phusion DNA Polymerase (Thermo Fisher Scientific Inc. USA) using Illumina’s PE primer set,  with cycle condition 98˚C  30sec followed by 10-15 cycles of 98˚C  10 sec, 65˚C  30 sec and 72˚C  30 sec, and then 72˚C  5min. The PCR products were purified using Ampure XP SPRI beads, and checked with Caliper LabChip GX for DNA samples using the High Sensitivity Assay (PerkinElmer, Inc. USA). PCR product of desired size range was purified using an in-house 96-channel size selection robot, and the DNA quality was assessed and quantified using an Agilent DNA 1000 series II assay and Quant-iT dsDNA HS Assay Kit using Qubit fluorometer (Invitrogen), then diluted to 8nM. The final concentration was verified by Quant-iT dsDNA HS Assay prior to Illumina HiSeq2000 PE 75 base sequencing.

Strand-specific RNA-seq (plate based) library construction (post-2014):

Total RNA samples were checked using an Agilent Bioanalyzer RNA nanochip or Caliper GX HT RNA LabChip, and samples passing quality control were arrayed into a 96-well plate. PolyA+ RNA was purified using the 96-well MultiMACS mRNA isolation kit on the MultiMACS 96 separator (Miltenyi Biotec, Germany) from 2ug total RNA with on-column DNaseI-treatment as per the manufacturer's instructions. The eluted PolyA+ RNA was ethanol precipitated and resuspended in 10µL of DEPC treated water with 1:20 SuperaseIN (Life Technologies, USA).

 

First-stranded cDNA was synthesized from the purified polyA+RNA using the Superscript cDNA Synthesis kit (Life Technologies, USA) and random hexamer primers at a concentration of 5µM along with a final concentration of 1ug/uL Actinomycin D, followed by Ampure XP SPRI beads on a Biomek FX robot (Beckman-Coulter, USA). The second strand cDNA was synthesized following the Superscript cDNA Synthesis protocol by replacing the dTTP with dUTP in dNTP mix, allowing second strand to be digested using UNG (Uracil-N-Glycosylase, Life Technologies, USA) in the post-adapter ligation reaction and thus achieving strand specificity.

The cDNA was quantified in a 96-well format using PicoGreen (Life Technologies, USA) and VICTOR3V Spectrophotometer (PerkinElmer, Inc. USA). The cDNA was fragmented by Covaris E210 sonication for 55 seconds at a “Duty cycle” of 20% and “Intensity” of 5. The paired-end sequencing library was prepared following the BC Cancer Agency Genome Sciences Centre strand-specific, plate-based and paired-end library construction protocol on a Biomek FX robot (Beckman-Coulter, USA). Briefly, the cDNA was purified in 96-well format using Ampure XP SPRI beads, and was subject to end-repair, and phosphorylation by T4 DNA polymerase, Klenow DNA Polymerase, and T4 polynucleotide kinase respectively in a single reaction, followed by cleanup using Ampure XP SPRI beads and 3’ A-tailing by Klenow fragment (3’ to 5’ exo minus). After purification using Ampure XP SPRI beads, picogreen quantification was performed to determine the amount of Illumina PE adapters to be used in the next step of adapter ligation reaction. The adapter-ligated products were purified using Ampure XP SPRI beads, and digested with UNG (1U/ul) at 37oC for 30 min followed by deactivation at 95oC for 15 min. The digested cDNA was purified using Ampure XP SPRI beads, and then PCR-amplified with Phusion DNA Polymerase (Thermo Fisher Scientific Inc. USA) using Illumina’s PE primer set,  with cycle condition 98˚C  30sec followed by 10-13 cycles of 98˚C  10 sec, 65˚C  30 sec and 72˚C  30 sec, and then 72˚C  5min. The PCR products were purified using Ampure XP SPRI beads, and checked with Caliper LabChip GX for DNA samples using the High Sensitivity Assay (PerkinElmer, Inc. USA). PCR product of the desired size range was purified using 8% PAGE, and the DNA quality was assessed and quantified using an Agilent DNA 1000 series II assay and Quant-iT dsDNA HS Assay Kit using Qubit fluorometer (Invitrogen), then diluted to 8nM. The final library concentration was double checked and determined by Quant-iT dsDNA HS Assay again for Illumina Sequencing.

RNA-Seq/hg19 read alignment:

Illumina paired-end RNA sequencing reads were aligned to GRCh37-lite genome-plus-junctions reference using BWA version 0.5.7. This reference combined genomic sequences in the GRCh37-lite assembly and exon-exon junction sequences whose corresponding coordinates were defined based on annotations of any transcripts in Ensembl (v59), Refseq and known genes from the UCSC genome browser, which was downloaded on August 19 2010, August 8 2010, and August 19 2010, respectively. Reads that mapped to junction regions were then repositioned back to the genome, and were marked with 'ZJ:Z' tags. BWA is run using default parameters, except that the option (-s) is included to disable Smith-Waterman alignment.

Finally, reads failing the Illumina chastity filter are flagged with a custom script, and duplicated reads were flagged with Picard Tools.

 

Structural variant detection

Was performed using ABySS (v1.3.2) and trans-ABySS (v1.4.6). For RNA-seq assembly alternate k-mers from k50-k96 were performed using positive strand and ambiguous stand reads as well as negative strand and ambiguous strand reads. The positive and negative strand assemblies were extended where possible, merged and then concatenated together to produce a meta-assembly contig dataset. The genome (WGS) libraries were assembled in single end mode using k-mer values of k24, and k44. The contigs and reads were then reassembled at k64 in single end mode and then finally at k64 in paired end mode. The meta-assemblies were then used as input to the trans-ABySS analysis pipeline (Robertson et al., 2010).

Large scale rearrangements and gene fusions from RNA-seq libraries were identified from contigs that had high confidence GMAP (v2012-12-20) alignments to two distinct genomic regions.  Evidence for the alignments were provided from aligning reads back to the contigs and from aligning reads to genomic coordinates. Events were then filtered on read thresholds. Large scale rearrangements and gene fusions from WGS libraries were identified in a similar way, but using BWA (v0.6.2-r126) alignments.

Insertions and deletions were identified by gapped alignment of contigs to the human reference using GMAP for RNA-seq and BWA for WGS. Confidence in the event was calculated from the alignment of reads back to the event breakpoint in the contigs.  The events were then screened against dbSNP and other variation databases to identify putative novel events.

To determine compartment specific events the structural variant calls for each patient from all matched genome and RNA-seq samples were concatenated together and screened against matching genome tumour, and where available germline bam files. This resulted in compartment specific structural variant events and where germline was available putative somatic and germline events. The events were further filtered against a compendium of germline structural variants to remove recurrent false positives.

SNV analysis of strand-specific RNA-seq data:

After repositioning, hg19-aligned BAM files were split into positive-fragment and negative-fragment BAM files based on the orientation of the paired-end reads. Unmapped and improperly paired aligned reads were put into the mix-fragment BAM. SNVs were then detected on positive- and negative-split BAMs separately using SNVMix2 (Goya et al., 2010) with parameters Mb and Q30.  The SNVs were further filtered to exclude those called based on 1) reference base N; 2) only 1 read supports the variant; 3) probability of heterozygous and homozygous of variant allele smaller than 0.99; 4) a position overlapping with insertions or deletions; 5) read supports from positions no more than 5 bases from read ends; 6) supports from reads only spanning an exon-exon junction; 7) more than 0.5 proportion of supporting reads were improper paired; 8) fewer than 2 proper-paired supporting reads.  SNVs located in exons equal or smaller than the read length, 100bp in this case, are a special case, because all their coverage may come from exon-exon junction spanning reads, so we also identified small-exonic SNVs that ware only supported by reads that spanning exon-exon junction but passed all other 7 filtering criteria mentioned above. These SNVs were finally annotated with SnpEff (Cingolani et al., 2012b) (Ensembl 66) and SnpSift (Cingolani et al., 2012a) (dbSNP137 and COSMIC64).

mRNA-Differential expression:

We used SAMseq (samr v2.0, R 2.15.0) two-class unpaired analyses with an FDR threshold of 0.05 to identify genes that were differentially expressed. For each run on a pair of sample groups, we first reduced the number of genes by removing those with median less than 5 RPKM in both groups, and those for which the Wilcoxon BH adjusted P-value between the two groups was greater than 0.05. This subset of genes was submitted to SAMseq. Each run generated a pair of files: genes ‘up’ and ‘down’. We then ranked the genes by a median-based fold change, and generated a figure showing up to 10 of the largest fold changes in each direction.

mRNA-NMF:

For specific mRNA-Seq expression datasets, we first removed genes expressed at or below a noise threshold of ≤ 0.2 reads per kilobase (of gene model) per million mapped reads (RPKM) in at least 75% of samples. We created the NMF input matrix using the top 25% most-variant genes, by ranking expressed genes having a mean RPKM of at least 10 by the coefficient of variation. We generated consensus clustering results with NMF v0.5.02 in R v1.12.0, with the default Brunet algorithm, and 200 iterations for the clustering run. Rank survey profiles for cophenetic and silhouette width suggest a specific cluster solution.

RNA-Seq (plate-based) library construction (pre-2014):

2-3 ug total RNA samples were arrayed into a 96-well plate and polyadenylated (PolyA+) RNA was purified using the 96-well MultiMACS mRNA isolation kit on the MultiMACS 96 separator (Miltenyi Biotec, Germany) with on column DNaseI-treatment as per the manufacturer's instructions. The eluted PolyA+ RNA was ethanol precipitated and resuspended in 10µL of DEPC treated water with 1:20 SuperaseIN (Life Technologies, USA). Double-stranded cDNA was synthesized from the purified polyA+RNA using the Superscript Double-Stranded cDNA Synthesis kit (Life Technologies, USA) and random hexamer primers at a concentration of 5µM. The cDNA was quantified in a 96-well format using PicoGreen (Life Technologies, USA) and VICTOR3V Spectrophotometer (PerkinElmer, Inc. USA). The quality was checked for a random sampling on an Agilent Bioanalyzer using the High Sensitivity DNA chip Assay.  cDNA was fragmented by Covaris E210 (Covaris, USA) sonication for 55 seconds, a “Duty cycle” of 20% and “Intensity” of 5. Plate-based libraries were prepared following the BC Cancer Agency, Genome Sciences Centre paired-end (PE) protocol on a Biomek FX robot (Beckman-Coulter, USA). Briefly, the cDNA was purified in 96-well format using Ampure XP SPRI beads, and was subject to end-repair and phosphorylation by T4 DNA polymerase, Klenow DNA Polymerase, and T4 polynucleotide kinase respectively in a single reaction, followed by cleanup using Ampure XP SPRI beads and 3’ A-tailling by Klenow fragment (3’ to 5’ exo minus). After cleanup using Ampure XP SPRI beads, picogreen quantification was performed to determine the amount of Illumina PE adapters used in the next step of adapter ligation reaction. The adapter-ligated products were purified using Ampure XP SPRI beads, then PCR-amplified with Phusion DNA Polymerase (Thermo Fisher Scientific Inc. USA) using Illumina’s PE primer set,  with cycle condition 98˚C  30sec followed by 10-15 cycles of 98˚C  10 sec, 65˚C  30 sec and 72˚C  30 sec, and then 72˚C  5min. The PCR products were purified using Ampure XP SPRI beads, and checked with Caliper LabChip GX for DNA samples using the High Sensitivity Assay (PerkinElmer, Inc. USA). PCR product of desired size range was purified using an in-house 96-channel size selection robot, and the DNA quality was assessed and quantified using an Agilent DNA 1000 series II assay and Quant-iT dsDNA HS Assay Kit using Qubit fluorometer (Invitrogen), then diluted to 8nM. The final concentration was verified by Quant-iT dsDNA HS Assay prior to Illumina HiSeq2000 PE 75 base sequencing.

Strand-specific RNA-seq (plate based) library construction (post-2014):

Total RNA samples were checked using an Agilent Bioanalyzer RNA nanochip or Caliper GX HT RNA LabChip, and samples passing quality control were arrayed into a 96-well plate. PolyA+ RNA was purified using the 96-well MultiMACS mRNA isolation kit on the MultiMACS 96 separator (Miltenyi Biotec, Germany) from 2ug total RNA with on-column DNaseI-treatment as per the manufacturer's instructions. The eluted PolyA+ RNA was ethanol precipitated and resuspended in 10µL of DEPC treated water with 1:20 SuperaseIN (Life Technologies, USA).

 

First-stranded cDNA was synthesized from the purified polyA+RNA using the Superscript cDNA Synthesis kit (Life Technologies, USA) and random hexamer primers at a concentration of 5µM along with a final concentration of 1ug/uL Actinomycin D, followed by Ampure XP SPRI beads on a Biomek FX robot (Beckman-Coulter, USA). The second strand cDNA was synthesized following the Superscript cDNA Synthesis protocol by replacing the dTTP with dUTP in dNTP mix, allowing second strand to be digested using UNG (Uracil-N-Glycosylase, Life Technologies, USA) in the post-adapter ligation reaction and thus achieving strand specificity.

The cDNA was quantified in a 96-well format using PicoGreen (Life Technologies, USA) and VICTOR3V Spectrophotometer (PerkinElmer, Inc. USA). The cDNA was fragmented by Covaris E210 sonication for 55 seconds at a “Duty cycle” of 20% and “Intensity” of 5. The paired-end sequencing library was prepared following the BC Cancer Agency Genome Sciences Centre strand-specific, plate-based and paired-end library construction protocol on a Biomek FX robot (Beckman-Coulter, USA). Briefly, the cDNA was purified in 96-well format using Ampure XP SPRI beads, and was subject to end-repair, and phosphorylation by T4 DNA polymerase, Klenow DNA Polymerase, and T4 polynucleotide kinase respectively in a single reaction, followed by cleanup using Ampure XP SPRI beads and 3’ A-tailing by Klenow fragment (3’ to 5’ exo minus). After purification using Ampure XP SPRI beads, picogreen quantification was performed to determine the amount of Illumina PE adapters to be used in the next step of adapter ligation reaction. The adapter-ligated products were purified using Ampure XP SPRI beads, and digested with UNG (1U/ul) at 37oC for 30 min followed by deactivation at 95oC for 15 min. The digested cDNA was purified using Ampure XP SPRI beads, and then PCR-amplified with Phusion DNA Polymerase (Thermo Fisher Scientific Inc. USA) using Illumina’s PE primer set,  with cycle condition 98˚C  30sec followed by 10-13 cycles of 98˚C  10 sec, 65˚C  30 sec and 72˚C  30 sec, and then 72˚C  5min. The PCR products were purified using Ampure XP SPRI beads, and checked with Caliper LabChip GX for DNA samples using the High Sensitivity Assay (PerkinElmer, Inc. USA). PCR product of the desired size range was purified using 8% PAGE, and the DNA quality was assessed and quantified using an Agilent DNA 1000 series II assay and Quant-iT dsDNA HS Assay Kit using Qubit fluorometer (Invitrogen), then diluted to 8nM. The final library concentration was double checked and determined by Quant-iT dsDNA HS Assay again for Illumina Sequencing.

RNA-Seq/hg19 read alignment:

Illumina paired-end RNA sequencing reads were aligned to GRCh37-lite genome-plus-junctions reference using BWA version 0.5.7. This reference combined genomic sequences in the GRCh37-lite assembly and exon-exon junction sequences whose corresponding coordinates were defined based on annotations of any transcripts in Ensembl (v59), Refseq and known genes from the UCSC genome browser, which was downloaded on August 19 2010, August 8 2010, and August 19 2010, respectively. Reads that mapped to junction regions were then repositioned back to the genome, and were marked with 'ZJ:Z' tags. BWA is run using default parameters, except that the option (-s) is included to disable Smith-Waterman alignment.

Finally, reads failing the Illumina chastity filter are flagged with a custom script, and duplicated reads were flagged with Picard Tools.

 

Structural variant detection

Was performed using ABySS (v1.3.2) and trans-ABySS (v1.4.6). For RNA-seq assembly alternate k-mers from k50-k96 were performed using positive strand and ambiguous stand reads as well as negative strand and ambiguous strand reads. The positive and negative strand assemblies were extended where possible, merged and then concatenated together to produce a meta-assembly contig dataset. The genome (WGS) libraries were assembled in single end mode using k-mer values of k24, and k44. The contigs and reads were then reassembled at k64 in single end mode and then finally at k64 in paired end mode. The meta-assemblies were then used as input to the trans-ABySS analysis pipeline (Robertson et al., 2010).

Large scale rearrangements and gene fusions from RNA-seq libraries were identified from contigs that had high confidence GMAP (v2012-12-20) alignments to two distinct genomic regions.  Evidence for the alignments were provided from aligning reads back to the contigs and from aligning reads to genomic coordinates. Events were then filtered on read thresholds. Large scale rearrangements and gene fusions from WGS libraries were identified in a similar way, but using BWA (v0.6.2-r126) alignments.

Insertions and deletions were identified by gapped alignment of contigs to the human reference using GMAP for RNA-seq and BWA for WGS. Confidence in the event was calculated from the alignment of reads back to the event breakpoint in the contigs.  The events were then screened against dbSNP and other variation databases to identify putative novel events.

To determine compartment specific events the structural variant calls for each patient from all matched genome and RNA-seq samples were concatenated together and screened against matching genome tumour, and where available germline bam files. This resulted in compartment specific structural variant events and where germline was available putative somatic and germline events. The events were further filtered against a compendium of germline structural variants to remove recurrent false positives.

SNV analysis of strand-specific RNA-seq data:

After repositioning, hg19-aligned BAM files were split into positive-fragment and negative-fragment BAM files based on the orientation of the paired-end reads. Unmapped and improperly paired aligned reads were put into the mix-fragment BAM. SNVs were then detected on positive- and negative-split BAMs separately using SNVMix2 (Goya et al., 2010) with parameters Mb and Q30.  The SNVs were further filtered to exclude those called based on 1) reference base N; 2) only 1 read supports the variant; 3) probability of heterozygous and homozygous of variant allele smaller than 0.99; 4) a position overlapping with insertions or deletions; 5) read supports from positions no more than 5 bases from read ends; 6) supports from reads only spanning an exon-exon junction; 7) more than 0.5 proportion of supporting reads were improper paired; 8) fewer than 2 proper-paired supporting reads.  SNVs located in exons equal or smaller than the read length, 100bp in this case, are a special case, because all their coverage may come from exon-exon junction spanning reads, so we also identified small-exonic SNVs that ware only supported by reads that spanning exon-exon junction but passed all other 7 filtering criteria mentioned above. These SNVs were finally annotated with SnpEff (Cingolani et al., 2012b) (Ensembl 66) and SnpSift (Cingolani et al., 2012a) (dbSNP137 and COSMIC64).

mRNA-Differential expression:

We used SAMseq (samr v2.0, R 2.15.0) two-class unpaired analyses with an FDR threshold of 0.05 to identify genes that were differentially expressed. For each run on a pair of sample groups, we first reduced the number of genes by removing those with median less than 5 RPKM in both groups, and those for which the Wilcoxon BH adjusted P-value between the two groups was greater than 0.05. This subset of genes was submitted to SAMseq. Each run generated a pair of files: genes ‘up’ and ‘down’. We then ranked the genes by a median-based fold change, and generated a figure showing up to 10 of the largest fold changes in each direction.

mRNA-NMF:

For specific mRNA-Seq expression datasets, we first removed genes expressed at or below a noise threshold of ≤ 0.2 reads per kilobase (of gene model) per million mapped reads (RPKM) in at least 75% of samples. We created the NMF input matrix using the top 25% most-variant genes, by ranking expressed genes having a mean RPKM of at least 10 by the coefficient of variation. We generated consensus clustering results with NMF v0.5.02 in R v1.12.0, with the default Brunet algorithm, and 200 iterations for the clustering run. Rank survey profiles for cophenetic and silhouette width suggest a specific cluster solution.

RNA-Seq (plate-based) library construction (pre-2014):

2-3 ug total RNA samples were arrayed into a 96-well plate and polyadenylated (PolyA+) RNA was purified using the 96-well MultiMACS mRNA isolation kit on the MultiMACS 96 separator (Miltenyi Biotec, Germany) with on column DNaseI-treatment as per the manufacturer's instructions. The eluted PolyA+ RNA was ethanol precipitated and resuspended in 10µL of DEPC treated water with 1:20 SuperaseIN (Life Technologies, USA). Double-stranded cDNA was synthesized from the purified polyA+RNA using the Superscript Double-Stranded cDNA Synthesis kit (Life Technologies, USA) and random hexamer primers at a concentration of 5µM. The cDNA was quantified in a 96-well format using PicoGreen (Life Technologies, USA) and VICTOR3V Spectrophotometer (PerkinElmer, Inc. USA). The quality was checked for a random sampling on an Agilent Bioanalyzer using the High Sensitivity DNA chip Assay.  cDNA was fragmented by Covaris E210 (Covaris, USA) sonication for 55 seconds, a “Duty cycle” of 20% and “Intensity” of 5. Plate-based libraries were prepared following the BC Cancer Agency, Genome Sciences Centre paired-end (PE) protocol on a Biomek FX robot (Beckman-Coulter, USA). Briefly, the cDNA was purified in 96-well format using Ampure XP SPRI beads, and was subject to end-repair and phosphorylation by T4 DNA polymerase, Klenow DNA Polymerase, and T4 polynucleotide kinase respectively in a single reaction, followed by cleanup using Ampure XP SPRI beads and 3’ A-tailling by Klenow fragment (3’ to 5’ exo minus). After cleanup using Ampure XP SPRI beads, picogreen quantification was performed to determine the amount of Illumina PE adapters used in the next step of adapter ligation reaction. The adapter-ligated products were purified using Ampure XP SPRI beads, then PCR-amplified with Phusion DNA Polymerase (Thermo Fisher Scientific Inc. USA) using Illumina’s PE primer set,  with cycle condition 98˚C  30sec followed by 10-15 cycles of 98˚C  10 sec, 65˚C  30 sec and 72˚C  30 sec, and then 72˚C  5min. The PCR products were purified using Ampure XP SPRI beads, and checked with Caliper LabChip GX for DNA samples using the High Sensitivity Assay (PerkinElmer, Inc. USA). PCR product of desired size range was purified using an in-house 96-channel size selection robot, and the DNA quality was assessed and quantified using an Agilent DNA 1000 series II assay and Quant-iT dsDNA HS Assay Kit using Qubit fluorometer (Invitrogen), then diluted to 8nM. The final concentration was verified by Quant-iT dsDNA HS Assay prior to Illumina HiSeq2000 PE 75 base sequencing.

Strand-specific RNA-seq (plate based) library construction (post-2014):

Total RNA samples were checked using an Agilent Bioanalyzer RNA nanochip or Caliper GX HT RNA LabChip, and samples passing quality control were arrayed into a 96-well plate. PolyA+ RNA was purified using the 96-well MultiMACS mRNA isolation kit on the MultiMACS 96 separator (Miltenyi Biotec, Germany) from 2ug total RNA with on-column DNaseI-treatment as per the manufacturer's instructions. The eluted PolyA+ RNA was ethanol precipitated and resuspended in 10µL of DEPC treated water with 1:20 SuperaseIN (Life Technologies, USA).

 

First-stranded cDNA was synthesized from the purified polyA+RNA using the Superscript cDNA Synthesis kit (Life Technologies, USA) and random hexamer primers at a concentration of 5µM along with a final concentration of 1ug/uL Actinomycin D, followed by Ampure XP SPRI beads on a Biomek FX robot (Beckman-Coulter, USA). The second strand cDNA was synthesized following the Superscript cDNA Synthesis protocol by replacing the dTTP with dUTP in dNTP mix, allowing second strand to be digested using UNG (Uracil-N-Glycosylase, Life Technologies, USA) in the post-adapter ligation reaction and thus achieving strand specificity.

The cDNA was quantified in a 96-well format using PicoGreen (Life Technologies, USA) and VICTOR3V Spectrophotometer (PerkinElmer, Inc. USA). The cDNA was fragmented by Covaris E210 sonication for 55 seconds at a “Duty cycle” of 20% and “Intensity” of 5. The paired-end sequencing library was prepared following the BC Cancer Agency Genome Sciences Centre strand-specific, plate-based and paired-end library construction protocol on a Biomek FX robot (Beckman-Coulter, USA). Briefly, the cDNA was purified in 96-well format using Ampure XP SPRI beads, and was subject to end-repair, and phosphorylation by T4 DNA polymerase, Klenow DNA Polymerase, and T4 polynucleotide kinase respectively in a single reaction, followed by cleanup using Ampure XP SPRI beads and 3’ A-tailing by Klenow fragment (3’ to 5’ exo minus). After purification using Ampure XP SPRI beads, picogreen quantification was performed to determine the amount of Illumina PE adapters to be used in the next step of adapter ligation reaction. The adapter-ligated products were purified using Ampure XP SPRI beads, and digested with UNG (1U/ul) at 37oC for 30 min followed by deactivation at 95oC for 15 min. The digested cDNA was purified using Ampure XP SPRI beads, and then PCR-amplified with Phusion DNA Polymerase (Thermo Fisher Scientific Inc. USA) using Illumina’s PE primer set,  with cycle condition 98˚C  30sec followed by 10-13 cycles of 98˚C  10 sec, 65˚C  30 sec and 72˚C  30 sec, and then 72˚C  5min. The PCR products were purified using Ampure XP SPRI beads, and checked with Caliper LabChip GX for DNA samples using the High Sensitivity Assay (PerkinElmer, Inc. USA). PCR product of the desired size range was purified using 8% PAGE, and the DNA quality was assessed and quantified using an Agilent DNA 1000 series II assay and Quant-iT dsDNA HS Assay Kit using Qubit fluorometer (Invitrogen), then diluted to 8nM. The final library concentration was double checked and determined by Quant-iT dsDNA HS Assay again for Illumina Sequencing.

RNA-Seq/hg19 read alignment:

Illumina paired-end RNA sequencing reads were aligned to GRCh37-lite genome-plus-junctions reference using BWA version 0.5.7. This reference combined genomic sequences in the GRCh37-lite assembly and exon-exon junction sequences whose corresponding coordinates were defined based on annotations of any transcripts in Ensembl (v59), Refseq and known genes from the UCSC genome browser, which was downloaded on August 19 2010, August 8 2010, and August 19 2010, respectively. Reads that mapped to junction regions were then repositioned back to the genome, and were marked with 'ZJ:Z' tags. BWA is run using default parameters, except that the option (-s) is included to disable Smith-Waterman alignment.

Finally, reads failing the Illumina chastity filter are flagged with a custom script, and duplicated reads were flagged with Picard Tools.

 

Structural variant detection

Was performed using ABySS (v1.3.2) and trans-ABySS (v1.4.6). For RNA-seq assembly alternate k-mers from k50-k96 were performed using positive strand and ambiguous stand reads as well as negative strand and ambiguous strand reads. The positive and negative strand assemblies were extended where possible, merged and then concatenated together to produce a meta-assembly contig dataset. The genome (WGS) libraries were assembled in single end mode using k-mer values of k24, and k44. The contigs and reads were then reassembled at k64 in single end mode and then finally at k64 in paired end mode. The meta-assemblies were then used as input to the trans-ABySS analysis pipeline (Robertson et al., 2010).

Large scale rearrangements and gene fusions from RNA-seq libraries were identified from contigs that had high confidence GMAP (v2012-12-20) alignments to two distinct genomic regions.  Evidence for the alignments were provided from aligning reads back to the contigs and from aligning reads to genomic coordinates. Events were then filtered on read thresholds. Large scale rearrangements and gene fusions from WGS libraries were identified in a similar way, but using BWA (v0.6.2-r126) alignments.

Insertions and deletions were identified by gapped alignment of contigs to the human reference using GMAP for RNA-seq and BWA for WGS. Confidence in the event was calculated from the alignment of reads back to the event breakpoint in the contigs.  The events were then screened against dbSNP and other variation databases to identify putative novel events.

To determine compartment specific events the structural variant calls for each patient from all matched genome and RNA-seq samples were concatenated together and screened against matching genome tumour, and where available germline bam files. This resulted in compartment specific structural variant events and where germline was available putative somatic and germline events. The events were further filtered against a compendium of germline structural variants to remove recurrent false positives.

SNV analysis of strand-specific RNA-seq data:

After repositioning, hg19-aligned BAM files were split into positive-fragment and negative-fragment BAM files based on the orientation of the paired-end reads. Unmapped and improperly paired aligned reads were put into the mix-fragment BAM. SNVs were then detected on positive- and negative-split BAMs separately using SNVMix2 (Goya et al., 2010) with parameters Mb and Q30.  The SNVs were further filtered to exclude those called based on 1) reference base N; 2) only 1 read supports the variant; 3) probability of heterozygous and homozygous of variant allele smaller than 0.99; 4) a position overlapping with insertions or deletions; 5) read supports from positions no more than 5 bases from read ends; 6) supports from reads only spanning an exon-exon junction; 7) more than 0.5 proportion of supporting reads were improper paired; 8) fewer than 2 proper-paired supporting reads.  SNVs located in exons equal or smaller than the read length, 100bp in this case, are a special case, because all their coverage may come from exon-exon junction spanning reads, so we also identified small-exonic SNVs that ware only supported by reads that spanning exon-exon junction but passed all other 7 filtering criteria mentioned above. These SNVs were finally annotated with SnpEff (Cingolani et al., 2012b) (Ensembl 66) and SnpSift (Cingolani et al., 2012a) (dbSNP137 and COSMIC64).

mRNA-Differential expression:

We used SAMseq (samr v2.0, R 2.15.0) two-class unpaired analyses with an FDR threshold of 0.05 to identify genes that were differentially expressed. For each run on a pair of sample groups, we first reduced the number of genes by removing those with median less than 5 RPKM in both groups, and those for which the Wilcoxon BH adjusted P-value between the two groups was greater than 0.05. This subset of genes was submitted to SAMseq. Each run generated a pair of files: genes ‘up’ and ‘down’. We then ranked the genes by a median-based fold change, and generated a figure showing up to 10 of the largest fold changes in each direction.

mRNA-NMF:

For specific mRNA-Seq expression datasets, we first removed genes expressed at or below a noise threshold of ≤ 0.2 reads per kilobase (of gene model) per million mapped reads (RPKM) in at least 75% of samples. We created the NMF input matrix using the top 25% most-variant genes, by ranking expressed genes having a mean RPKM of at least 10 by the coefficient of variation. We generated consensus clustering results with NMF v0.5.02 in R v1.12.0, with the default Brunet algorithm, and 200 iterations for the clustering run. Rank survey profiles for cophenetic and silhouette width suggest a specific cluster solution.

RNA-Seq (plate-based) library construction (pre-2014):

2-3 ug total RNA samples were arrayed into a 96-well plate and polyadenylated (PolyA+) RNA was purified using the 96-well MultiMACS mRNA isolation kit on the MultiMACS 96 separator (Miltenyi Biotec, Germany) with on column DNaseI-treatment as per the manufacturer's instructions. The eluted PolyA+ RNA was ethanol precipitated and resuspended in 10µL of DEPC treated water with 1:20 SuperaseIN (Life Technologies, USA). Double-stranded cDNA was synthesized from the purified polyA+RNA using the Superscript Double-Stranded cDNA Synthesis kit (Life Technologies, USA) and random hexamer primers at a concentration of 5µM. The cDNA was quantified in a 96-well format using PicoGreen (Life Technologies, USA) and VICTOR3V Spectrophotometer (PerkinElmer, Inc. USA). The quality was checked for a random sampling on an Agilent Bioanalyzer using the High Sensitivity DNA chip Assay.  cDNA was fragmented by Covaris E210 (Covaris, USA) sonication for 55 seconds, a “Duty cycle” of 20% and “Intensity” of 5. Plate-based libraries were prepared following the BC Cancer Agency, Genome Sciences Centre paired-end (PE) protocol on a Biomek FX robot (Beckman-Coulter, USA). Briefly, the cDNA was purified in 96-well format using Ampure XP SPRI beads, and was subject to end-repair and phosphorylation by T4 DNA polymerase, Klenow DNA Polymerase, and T4 polynucleotide kinase respectively in a single reaction, followed by cleanup using Ampure XP SPRI beads and 3’ A-tailling by Klenow fragment (3’ to 5’ exo minus). After cleanup using Ampure XP SPRI beads, picogreen quantification was performed to determine the amount of Illumina PE adapters used in the next step of adapter ligation reaction. The adapter-ligated products were purified using Ampure XP SPRI beads, then PCR-amplified with Phusion DNA Polymerase (Thermo Fisher Scientific Inc. USA) using Illumina’s PE primer set,  with cycle condition 98˚C  30sec followed by 10-15 cycles of 98˚C  10 sec, 65˚C  30 sec and 72˚C  30 sec, and then 72˚C  5min. The PCR products were purified using Ampure XP SPRI beads, and checked with Caliper LabChip GX for DNA samples using the High Sensitivity Assay (PerkinElmer, Inc. USA). PCR product of desired size range was purified using an in-house 96-channel size selection robot, and the DNA quality was assessed and quantified using an Agilent DNA 1000 series II assay and Quant-iT dsDNA HS Assay Kit using Qubit fluorometer (Invitrogen), then diluted to 8nM. The final concentration was verified by Quant-iT dsDNA HS Assay prior to Illumina HiSeq2000 PE 75 base sequencing.

Strand-specific RNA-seq (plate based) library construction (post-2014):

Total RNA samples were checked using an Agilent Bioanalyzer RNA nanochip or Caliper GX HT RNA LabChip, and samples passing quality control were arrayed into a 96-well plate. PolyA+ RNA was purified using the 96-well MultiMACS mRNA isolation kit on the MultiMACS 96 separator (Miltenyi Biotec, Germany) from 2ug total RNA with on-column DNaseI-treatment as per the manufacturer's instructions. The eluted PolyA+ RNA was ethanol precipitated and resuspended in 10µL of DEPC treated water with 1:20 SuperaseIN (Life Technologies, USA).

 

First-stranded cDNA was synthesized from the purified polyA+RNA using the Superscript cDNA Synthesis kit (Life Technologies, USA) and random hexamer primers at a concentration of 5µM along with a final concentration of 1ug/uL Actinomycin D, followed by Ampure XP SPRI beads on a Biomek FX robot (Beckman-Coulter, USA). The second strand cDNA was synthesized following the Superscript cDNA Synthesis protocol by replacing the dTTP with dUTP in dNTP mix, allowing second strand to be digested using UNG (Uracil-N-Glycosylase, Life Technologies, USA) in the post-adapter ligation reaction and thus achieving strand specificity.

The cDNA was quantified in a 96-well format using PicoGreen (Life Technologies, USA) and VICTOR3V Spectrophotometer (PerkinElmer, Inc. USA). The cDNA was fragmented by Covaris E210 sonication for 55 seconds at a “Duty cycle” of 20% and “Intensity” of 5. The paired-end sequencing library was prepared following the BC Cancer Agency Genome Sciences Centre strand-specific, plate-based and paired-end library construction protocol on a Biomek FX robot (Beckman-Coulter, USA). Briefly, the cDNA was purified in 96-well format using Ampure XP SPRI beads, and was subject to end-repair, and phosphorylation by T4 DNA polymerase, Klenow DNA Polymerase, and T4 polynucleotide kinase respectively in a single reaction, followed by cleanup using Ampure XP SPRI beads and 3’ A-tailing by Klenow fragment (3’ to 5’ exo minus). After purification using Ampure XP SPRI beads, picogreen quantification was performed to determine the amount of Illumina PE adapters to be used in the next step of adapter ligation reaction. The adapter-ligated products were purified using Ampure XP SPRI beads, and digested with UNG (1U/ul) at 37oC for 30 min followed by deactivation at 95oC for 15 min. The digested cDNA was purified using Ampure XP SPRI beads, and then PCR-amplified with Phusion DNA Polymerase (Thermo Fisher Scientific Inc. USA) using Illumina’s PE primer set,  with cycle condition 98˚C  30sec followed by 10-13 cycles of 98˚C  10 sec, 65˚C  30 sec and 72˚C  30 sec, and then 72˚C  5min. The PCR products were purified using Ampure XP SPRI beads, and checked with Caliper LabChip GX for DNA samples using the High Sensitivity Assay (PerkinElmer, Inc. USA). PCR product of the desired size range was purified using 8% PAGE, and the DNA quality was assessed and quantified using an Agilent DNA 1000 series II assay and Quant-iT dsDNA HS Assay Kit using Qubit fluorometer (Invitrogen), then diluted to 8nM. The final library concentration was double checked and determined by Quant-iT dsDNA HS Assay again for Illumina Sequencing.

RNA-Seq/hg19 read alignment:

Illumina paired-end RNA sequencing reads were aligned to GRCh37-lite genome-plus-junctions reference using BWA version 0.5.7. This reference combined genomic sequences in the GRCh37-lite assembly and exon-exon junction sequences whose corresponding coordinates were defined based on annotations of any transcripts in Ensembl (v59), Refseq and known genes from the UCSC genome browser, which was downloaded on August 19 2010, August 8 2010, and August 19 2010, respectively. Reads that mapped to junction regions were then repositioned back to the genome, and were marked with 'ZJ:Z' tags. BWA is run using default parameters, except that the option (-s) is included to disable Smith-Waterman alignment.

Finally, reads failing the Illumina chastity filter are flagged with a custom script, and duplicated reads were flagged with Picard Tools.

 

Structural variant detection

Was performed using ABySS (v1.3.2) and trans-ABySS (v1.4.6). For RNA-seq assembly alternate k-mers from k50-k96 were performed using positive strand and ambiguous stand reads as well as negative strand and ambiguous strand reads. The positive and negative strand assemblies were extended where possible, merged and then concatenated together to produce a meta-assembly contig dataset. The genome (WGS) libraries were assembled in single end mode using k-mer values of k24, and k44. The contigs and reads were then reassembled at k64 in single end mode and then finally at k64 in paired end mode. The meta-assemblies were then used as input to the trans-ABySS analysis pipeline (Robertson et al., 2010).

Large scale rearrangements and gene fusions from RNA-seq libraries were identified from contigs that had high confidence GMAP (v2012-12-20) alignments to two distinct genomic regions.  Evidence for the alignments were provided from aligning reads back to the contigs and from aligning reads to genomic coordinates. Events were then filtered on read thresholds. Large scale rearrangements and gene fusions from WGS libraries were identified in a similar way, but using BWA (v0.6.2-r126) alignments.

Insertions and deletions were identified by gapped alignment of contigs to the human reference using GMAP for RNA-seq and BWA for WGS. Confidence in the event was calculated from the alignment of reads back to the event breakpoint in the contigs.  The events were then screened against dbSNP and other variation databases to identify putative novel events.

To determine compartment specific events the structural variant calls for each patient from all matched genome and RNA-seq samples were concatenated together and screened against matching genome tumour, and where available germline bam files. This resulted in compartment specific structural variant events and where germline was available putative somatic and germline events. The events were further filtered against a compendium of germline structural variants to remove recurrent false positives.

SNV analysis of strand-specific RNA-seq data:

After repositioning, hg19-aligned BAM files were split into positive-fragment and negative-fragment BAM files based on the orientation of the paired-end reads. Unmapped and improperly paired aligned reads were put into the mix-fragment BAM. SNVs were then detected on positive- and negative-split BAMs separately using SNVMix2 (Goya et al., 2010) with parameters Mb and Q30.  The SNVs were further filtered to exclude those called based on 1) reference base N; 2) only 1 read supports the variant; 3) probability of heterozygous and homozygous of variant allele smaller than 0.99; 4) a position overlapping with insertions or deletions; 5) read supports from positions no more than 5 bases from read ends; 6) supports from reads only spanning an exon-exon junction; 7) more than 0.5 proportion of supporting reads were improper paired; 8) fewer than 2 proper-paired supporting reads.  SNVs located in exons equal or smaller than the read length, 100bp in this case, are a special case, because all their coverage may come from exon-exon junction spanning reads, so we also identified small-exonic SNVs that ware only supported by reads that spanning exon-exon junction but passed all other 7 filtering criteria mentioned above. These SNVs were finally annotated with SnpEff (Cingolani et al., 2012b) (Ensembl 66) and SnpSift (Cingolani et al., 2012a) (dbSNP137 and COSMIC64).

mRNA-Differential expression:

We used SAMseq (samr v2.0, R 2.15.0) two-class unpaired analyses with an FDR threshold of 0.05 to identify genes that were differentially expressed. For each run on a pair of sample groups, we first reduced the number of genes by removing those with median less than 5 RPKM in both groups, and those for which the Wilcoxon BH adjusted P-value between the two groups was greater than 0.05. This subset of genes was submitted to SAMseq. Each run generated a pair of files: genes ‘up’ and ‘down’. We then ranked the genes by a median-based fold change, and generated a figure showing up to 10 of the largest fold changes in each direction.

mRNA-NMF:

For specific mRNA-Seq expression datasets, we first removed genes expressed at or below a noise threshold of ≤ 0.2 reads per kilobase (of gene model) per million mapped reads (RPKM) in at least 75% of samples. We created the NMF input matrix using the top 25% most-variant genes, by ranking expressed genes having a mean RPKM of at least 10 by the coefficient of variation. We generated consensus clustering results with NMF v0.5.02 in R v1.12.0, with the default Brunet algorithm, and 200 iterations for the clustering run. Rank survey profiles for cophenetic and silhouette width suggest a specific cluster solution.

RNA-Seq (plate-based) library construction (pre-2014):

2-3 ug total RNA samples were arrayed into a 96-well plate and polyadenylated (PolyA+) RNA was purified using the 96-well MultiMACS mRNA isolation kit on the MultiMACS 96 separator (Miltenyi Biotec, Germany) with on column DNaseI-treatment as per the manufacturer's instructions. The eluted PolyA+ RNA was ethanol precipitated and resuspended in 10µL of DEPC treated water with 1:20 SuperaseIN (Life Technologies, USA). Double-stranded cDNA was synthesized from the purified polyA+RNA using the Superscript Double-Stranded cDNA Synthesis kit (Life Technologies, USA) and random hexamer primers at a concentration of 5µM. The cDNA was quantified in a 96-well format using PicoGreen (Life Technologies, USA) and VICTOR3V Spectrophotometer (PerkinElmer, Inc. USA). The quality was checked for a random sampling on an Agilent Bioanalyzer using the High Sensitivity DNA chip Assay.  cDNA was fragmented by Covaris E210 (Covaris, USA) sonication for 55 seconds, a “Duty cycle” of 20% and “Intensity” of 5. Plate-based libraries were prepared following the BC Cancer Agency, Genome Sciences Centre paired-end (PE) protocol on a Biomek FX robot (Beckman-Coulter, USA). Briefly, the cDNA was purified in 96-well format using Ampure XP SPRI beads, and was subject to end-repair and phosphorylation by T4 DNA polymerase, Klenow DNA Polymerase, and T4 polynucleotide kinase respectively in a single reaction, followed by cleanup using Ampure XP SPRI beads and 3’ A-tailling by Klenow fragment (3’ to 5’ exo minus). After cleanup using Ampure XP SPRI beads, picogreen quantification was performed to determine the amount of Illumina PE adapters used in the next step of adapter ligation reaction. The adapter-ligated products were purified using Ampure XP SPRI beads, then PCR-amplified with Phusion DNA Polymerase (Thermo Fisher Scientific Inc. USA) using Illumina’s PE primer set,  with cycle condition 98˚C  30sec followed by 10-15 cycles of 98˚C  10 sec, 65˚C  30 sec and 72˚C  30 sec, and then 72˚C  5min. The PCR products were purified using Ampure XP SPRI beads, and checked with Caliper LabChip GX for DNA samples using the High Sensitivity Assay (PerkinElmer, Inc. USA). PCR product of desired size range was purified using an in-house 96-channel size selection robot, and the DNA quality was assessed and quantified using an Agilent DNA 1000 series II assay and Quant-iT dsDNA HS Assay Kit using Qubit fluorometer (Invitrogen), then diluted to 8nM. The final concentration was verified by Quant-iT dsDNA HS Assay prior to Illumina HiSeq2000 PE 75 base sequencing.

Strand-specific RNA-seq (plate based) library construction (post-2014):

Total RNA samples were checked using an Agilent Bioanalyzer RNA nanochip or Caliper GX HT RNA LabChip, and samples passing quality control were arrayed into a 96-well plate. PolyA+ RNA was purified using the 96-well MultiMACS mRNA isolation kit on the MultiMACS 96 separator (Miltenyi Biotec, Germany) from 2ug total RNA with on-column DNaseI-treatment as per the manufacturer's instructions. The eluted PolyA+ RNA was ethanol precipitated and resuspended in 10µL of DEPC treated water with 1:20 SuperaseIN (Life Technologies, USA).

 

First-stranded cDNA was synthesized from the purified polyA+RNA using the Superscript cDNA Synthesis kit (Life Technologies, USA) and random hexamer primers at a concentration of 5µM along with a final concentration of 1ug/uL Actinomycin D, followed by Ampure XP SPRI beads on a Biomek FX robot (Beckman-Coulter, USA). The second strand cDNA was synthesized following the Superscript cDNA Synthesis protocol by replacing the dTTP with dUTP in dNTP mix, allowing second strand to be digested using UNG (Uracil-N-Glycosylase, Life Technologies, USA) in the post-adapter ligation reaction and thus achieving strand specificity.

The cDNA was quantified in a 96-well format using PicoGreen (Life Technologies, USA) and VICTOR3V Spectrophotometer (PerkinElmer, Inc. USA). The cDNA was fragmented by Covaris E210 sonication for 55 seconds at a “Duty cycle” of 20% and “Intensity” of 5. The paired-end sequencing library was prepared following the BC Cancer Agency Genome Sciences Centre strand-specific, plate-based and paired-end library construction protocol on a Biomek FX robot (Beckman-Coulter, USA). Briefly, the cDNA was purified in 96-well format using Ampure XP SPRI beads, and was subject to end-repair, and phosphorylation by T4 DNA polymerase, Klenow DNA Polymerase, and T4 polynucleotide kinase respectively in a single reaction, followed by cleanup using Ampure XP SPRI beads and 3’ A-tailing by Klenow fragment (3’ to 5’ exo minus). After purification using Ampure XP SPRI beads, picogreen quantification was performed to determine the amount of Illumina PE adapters to be used in the next step of adapter ligation reaction. The adapter-ligated products were purified using Ampure XP SPRI beads, and digested with UNG (1U/ul) at 37oC for 30 min followed by deactivation at 95oC for 15 min. The digested cDNA was purified using Ampure XP SPRI beads, and then PCR-amplified with Phusion DNA Polymerase (Thermo Fisher Scientific Inc. USA) using Illumina’s PE primer set,  with cycle condition 98˚C  30sec followed by 10-13 cycles of 98˚C  10 sec, 65˚C  30 sec and 72˚C  30 sec, and then 72˚C  5min. The PCR products were purified using Ampure XP SPRI beads, and checked with Caliper LabChip GX for DNA samples using the High Sensitivity Assay (PerkinElmer, Inc. USA). PCR product of the desired size range was purified using 8% PAGE, and the DNA quality was assessed and quantified using an Agilent DNA 1000 series II assay and Quant-iT dsDNA HS Assay Kit using Qubit fluorometer (Invitrogen), then diluted to 8nM. The final library concentration was double checked and determined by Quant-iT dsDNA HS Assay again for Illumina Sequencing.

Strand-specific ribodepletion RNA sequencing:

Enzymatic reactions were set-up in a 96-well plate (Thermo Fisher Scientific) on a Microlab NIMBUS liquid handler (Hamilton Robotics, USA). 100ng of DNase I treated total RNA in 6 µL was hybridized to rRNA probes in a 7.5 µL reaction. Heat-sealed plates were incubated at 95oC for 2 minutes followed by incremental reduction in temperature by 0.1oC per second to 22oC (730 cycles). The rRNA in DNA hybrids were digested using RNase H in a 10 µL reaction incubated in a thermocycler at 37oC for 30 minutes. To remove excess rRNA probes (DNA) and residual genomic DNA contamination, DNase I was added in a total reaction volume of 25 µL and incubated at 37oC for 30 minutes. RNA was purified using RNA MagClean DX beads (Aline Biosciences, USA) with 15 minutes of binding time, 7 minutes clearing on a magnet followed by two 70% ethanol washes, 5 minutes to air dry the RNA pellet and elution in 36uL DEPC water. The plate containing RNA was stored at -80oC prior to cDNA synthesis.

First-strand cDNA was synthesized from the purified RNA (minus rRNA) using the Maxima H Minus First Strand cDNA Synthesis kit (Thermo-Fisher, USA) and random hexamer primers at a concentration of 8ng/µL along with a final concentration of 0.4µg/µL Actinomycin D, followed by PCR Clean DX bead purification on a Microlab NIMBUS robot (Hamilton Robotics, USA). The second strand cDNA was synthesized following the NEBNext Ultra Directional Second Strand cDNA Synthesis protocol (NEB) that incorporates dUTP in the dNTP mix, allowing the second strand to be digested using USERTM enzyme (NEB) in the post-adapter ligation reaction and thus achieving strand specificity.

cDNA was fragmented by Covaris LE220 sonication for 130seconds (2x65seconds) at a “Duty cycle” of 30%, 450 Peak Incident Power (W) and 200 Cycles per Burst in a 96-well microTUBE Plate (P/N: 520078) to achieve 200-250 bp average fragment lengths. The paired-end sequencing library was prepared following the BC Cancer Agency Genome Sciences Centre strand-specific, plate-based library construction protocol on a Microlab NIMBUS robot (Hamilton Robotics, USA). Briefly, the sheared cDNA was subject to end-repair and phosphorylation in a single reaction using an enzyme premix (NEB) containing T4 DNA polymerase, Klenow DNA Polymerase and T4 polynucleotide kinase, incubated at 20oC for 30 minutes. Repaired cDNA was purified in 96-well format using PCR Clean DX beads (Aline Biosciences, USA), and 3’ A-tailed (adenylation) using Klenow fragment (3’ to 5’ exo minus) and incubation at 37oC for 30 minutes prior to enzyme heat inactivation. Illumina PE adapters were ligated at 20oC for 15 minutes. The adapter-ligated products were purified using PCR Clean DX beads, then digested with USERTM enzyme (1 U/µL, NEB) at 37oC for 15 minutes followed immediately by 13 cycles of indexed PCR using Phusion DNA Polymerase (Thermo Fisher Scientific Inc. USA) and Illumina’s PE primer set. PCR parameters: 98˚C for 1 minute followed by 13 cycles of 98˚C 15 seconds, 65˚C 30 seconds and 72˚C 30 seconds, and then 72˚C 5 minutes. The PCR products were purified and size-selected using a 1:1 PCR Clean DX beads-to-sample ratio (twice), and the eluted DNA quality was assessed with Caliper LabChip GX for DNA samples using the High Sensitivity Assay (PerkinElmer, Inc. USA) and quantified using a Quant-iT dsDNA High Sensitivity Assay Kit on a Qubit fluorometer (Invitrogen) prior to library pooling and size-corrected final molar concentration calculation for Illumina HiSeq2500 sequencing with paired-end 75 base reads.

Strand-specific RNA-seq (plate based) library construction:

Total RNA samples were checked using an Agilent Bioanalyzer RNA nanochip or Caliper GX HT RNA LabChip, and samples passing quality control were arrayed into a 96-well plate. PolyA+ RNA was purified using the 96-well MultiMACS mRNA isolation kit on the MultiMACS 96 separator (Miltenyi Biotec, Germany) from 2ug total RNA with on-column DNaseI-treatment as per the manufacturer's instructions. The eluted PolyA+ RNA was ethanol precipitated and resuspended in 10µL of DEPC treated water with 1:20 SuperaseIN (Life Technologies, USA).

 

First-stranded cDNA was synthesized from the purified polyA+RNA using the Superscript cDNA Synthesis kit (Life Technologies, USA) and random hexamer primers at a concentration of 5µM along with a final concentration of 1ug/uL Actinomycin D, followed by Ampure XP SPRI beads on a Biomek FX robot (Beckman-Coulter, USA). The second strand cDNA was synthesized following the Superscript cDNA Synthesis protocol by replacing the dTTP with dUTP in dNTP mix, allowing second strand to be digested using UNG (Uracil-N-Glycosylase, Life Technologies, USA) in the post-adapter ligation reaction and thus achieving strand specificity.

The cDNA was quantified in a 96-well format using PicoGreen (Life Technologies, USA) and VICTOR3V Spectrophotometer (PerkinElmer, Inc. USA). The cDNA was fragmented by Covaris E210 sonication for 55 seconds at a “Duty cycle” of 20% and “Intensity” of 5. The paired-end sequencing library was prepared following the BC Cancer Agency Genome Sciences Centre strand-specific, plate-based and paired-end library construction protocol on a Biomek FX robot (Beckman-Coulter, USA). Briefly, the cDNA was purified in 96-well format using Ampure XP SPRI beads, and was subject to end-repair, and phosphorylation by T4 DNA polymerase, Klenow DNA Polymerase, and T4 polynucleotide kinase respectively in a single reaction, followed by cleanup using Ampure XP SPRI beads and 3’ A-tailing by Klenow fragment (3’ to 5’ exo minus). After purification using Ampure XP SPRI beads, picogreen quantification was performed to determine the amount of Illumina PE adapters to be used in the next step of adapter ligation reaction. The adapter-ligated products were purified using Ampure XP SPRI beads, and digested with UNG (1U/ul) at 37oC for 30 min followed by deactivation at 95oC for 15 min. The digested cDNA was purified using Ampure XP SPRI beads, and then PCR-amplified with Phusion DNA Polymerase (Thermo Fisher Scientific Inc. USA) using Illumina’s PE primer set,  with cycle condition 98˚C  30sec followed by 10-13 cycles of 98˚C  10 sec, 65˚C  30 sec and 72˚C  30 sec, and then 72˚C  5min. The PCR products were purified using Ampure XP SPRI beads, and checked with Caliper LabChip GX for DNA samples using the High Sensitivity Assay (PerkinElmer, Inc. USA). PCR product of the desired size range was purified using 8% PAGE, and the DNA quality was assessed and quantified using an Agilent DNA 1000 series II assay and Quant-iT dsDNA HS Assay Kit using Qubit fluorometer (Invitrogen), then diluted to 8nM. The final library concentration was double checked and determined by Quant-iT dsDNA HS Assay again for Illumina Sequencing.

RNA-Seq/hg19 read alignment:

Illumina paired-end RNA sequencing reads were aligned to GRCh37-lite genome-plus-junctions reference using BWA version 0.5.7. This reference combined genomic sequences in the GRCh37-lite assembly and exon-exon junction sequences whose corresponding coordinates were defined based on annotations of any transcripts in Ensembl (v59), Refseq and known genes from the UCSC genome browser, which was downloaded on August 19 2010, August 8 2010, and August 19 2010, respectively. Reads that mapped to junction regions were then repositioned back to the genome, and were marked with 'ZJ:Z' tags. BWA is run using default parameters, except that the option (-s) is included to disable Smith-Waterman alignment.

Finally, reads failing the Illumina chastity filter are flagged with a custom script, and duplicated reads were flagged with Picard Tools.

 

Structural variant detection

Was performed using ABySS (v1.3.2) and trans-ABySS (v1.4.6). For RNA-seq assembly alternate k-mers from k50-k96 were performed using positive strand and ambiguous stand reads as well as negative strand and ambiguous strand reads. The positive and negative strand assemblies were extended where possible, merged and then concatenated together to produce a meta-assembly contig dataset. The genome (WGS) libraries were assembled in single end mode using k-mer values of k24, and k44. The contigs and reads were then reassembled at k64 in single end mode and then finally at k64 in paired end mode. The meta-assemblies were then used as input to the trans-ABySS analysis pipeline (Robertson et al., 2010).

Large scale rearrangements and gene fusions from RNA-seq libraries were identified from contigs that had high confidence GMAP (v2012-12-20) alignments to two distinct genomic regions.  Evidence for the alignments were provided from aligning reads back to the contigs and from aligning reads to genomic coordinates. Events were then filtered on read thresholds. Large scale rearrangements and gene fusions from WGS libraries were identified in a similar way, but using BWA (v0.6.2-r126) alignments.

Insertions and deletions were identified by gapped alignment of contigs to the human reference using GMAP for RNA-seq and BWA for WGS. Confidence in the event was calculated from the alignment of reads back to the event breakpoint in the contigs.  The events were then screened against dbSNP and other variation databases to identify putative novel events.

To determine compartment specific events the structural variant calls for each patient from all matched genome and RNA-seq samples were concatenated together and screened against matching genome tumour, and where available germline bam files. This resulted in compartment specific structural variant events and where germline was available putative somatic and germline events. The events were further filtered against a compendium of germline structural variants to remove recurrent false positives.

SNV analysis of strand-specific RNA-seq data:

After repositioning, hg19-aligned BAM files were split into positive-fragment and negative-fragment BAM files based on the orientation of the paired-end reads. Unmapped and improperly paired aligned reads were put into the mix-fragment BAM. SNVs were then detected on positive- and negative-split BAMs separately using SNVMix2 (Goya et al., 2010) with parameters Mb and Q30.  The SNVs were further filtered to exclude those called based on 1) reference base N; 2) only 1 read supports the variant; 3) probability of heterozygous and homozygous of variant allele smaller than 0.99; 4) a position overlapping with insertions or deletions; 5) read supports from positions no more than 5 bases from read ends; 6) supports from reads only spanning an exon-exon junction; 7) more than 0.5 proportion of supporting reads were improper paired; 8) fewer than 2 proper-paired supporting reads.  SNVs located in exons equal or smaller than the read length, 100bp in this case, are a special case, because all their coverage may come from exon-exon junction spanning reads, so we also identified small-exonic SNVs that ware only supported by reads that spanning exon-exon junction but passed all other 7 filtering criteria mentioned above. These SNVs were finally annotated with SnpEff (Cingolani et al., 2012b) (Ensembl 66) and SnpSift (Cingolani et al., 2012a) (dbSNP137 and COSMIC64).

mRNA-Differential expression:

We used SAMseq (samr v2.0, R 2.15.0) two-class unpaired analyses with an FDR threshold of 0.05 to identify genes that were differentially expressed. For each run on a pair of sample groups, we first reduced the number of genes by removing those with median less than 5 RPKM in both groups, and those for which the Wilcoxon BH adjusted P-value between the two groups was greater than 0.05. This subset of genes was submitted to SAMseq. Each run generated a pair of files: genes ‘up’ and ‘down’. We then ranked the genes by a median-based fold change, and generated a figure showing up to 10 of the largest fold changes in each direction.

mRNA-NMF:

For specific mRNA-Seq expression datasets, we first removed genes expressed at or below a noise threshold of ≤ 0.2 reads per kilobase (of gene model) per million mapped reads (RPKM) in at least 75% of samples. We created the NMF input matrix using the top 25% most-variant genes, by ranking expressed genes having a mean RPKM of at least 10 by the coefficient of variation. We generated consensus clustering results with NMF v0.5.02 in R v1.12.0, with the default Brunet algorithm, and 200 iterations for the clustering run. Rank survey profiles for cophenetic and silhouette width suggest a specific cluster solution.

Strand-specific RNA-seq (plate based) library construction:

Total RNA samples were checked using an Agilent Bioanalyzer RNA nanochip or Caliper GX HT RNA LabChip, and samples passing quality control were arrayed into a 96-well plate. PolyA+ RNA was purified using the 96-well MultiMACS mRNA isolation kit on the MultiMACS 96 separator (Miltenyi Biotec, Germany) from 2ug total RNA with on-column DNaseI-treatment as per the manufacturer's instructions. The eluted PolyA+ RNA was ethanol precipitated and resuspended in 10µL of DEPC treated water with 1:20 SuperaseIN (Life Technologies, USA).

 

First-stranded cDNA was synthesized from the purified polyA+RNA using the Superscript cDNA Synthesis kit (Life Technologies, USA) and random hexamer primers at a concentration of 5µM along with a final concentration of 1ug/uL Actinomycin D, followed by Ampure XP SPRI beads on a Biomek FX robot (Beckman-Coulter, USA). The second strand cDNA was synthesized following the Superscript cDNA Synthesis protocol by replacing the dTTP with dUTP in dNTP mix, allowing second strand to be digested using UNG (Uracil-N-Glycosylase, Life Technologies, USA) in the post-adapter ligation reaction and thus achieving strand specificity.

The cDNA was quantified in a 96-well format using PicoGreen (Life Technologies, USA) and VICTOR3V Spectrophotometer (PerkinElmer, Inc. USA). The cDNA was fragmented by Covaris E210 sonication for 55 seconds at a “Duty cycle” of 20% and “Intensity” of 5. The paired-end sequencing library was prepared following the BC Cancer Agency Genome Sciences Centre strand-specific, plate-based and paired-end library construction protocol on a Biomek FX robot (Beckman-Coulter, USA). Briefly, the cDNA was purified in 96-well format using Ampure XP SPRI beads, and was subject to end-repair, and phosphorylation by T4 DNA polymerase, Klenow DNA Polymerase, and T4 polynucleotide kinase respectively in a single reaction, followed by cleanup using Ampure XP SPRI beads and 3’ A-tailing by Klenow fragment (3’ to 5’ exo minus). After purification using Ampure XP SPRI beads, picogreen quantification was performed to determine the amount of Illumina PE adapters to be used in the next step of adapter ligation reaction. The adapter-ligated products were purified using Ampure XP SPRI beads, and digested with UNG (1U/ul) at 37oC for 30 min followed by deactivation at 95oC for 15 min. The digested cDNA was purified using Ampure XP SPRI beads, and then PCR-amplified with Phusion DNA Polymerase (Thermo Fisher Scientific Inc. USA) using Illumina’s PE primer set,  with cycle condition 98˚C  30sec followed by 10-13 cycles of 98˚C  10 sec, 65˚C  30 sec and 72˚C  30 sec, and then 72˚C  5min. The PCR products were purified using Ampure XP SPRI beads, and checked with Caliper LabChip GX for DNA samples using the High Sensitivity Assay (PerkinElmer, Inc. USA). PCR product of the desired size range was purified using 8% PAGE, and the DNA quality was assessed and quantified using an Agilent DNA 1000 series II assay and Quant-iT dsDNA HS Assay Kit using Qubit fluorometer (Invitrogen), then diluted to 8nM. The final library concentration was double checked and determined by Quant-iT dsDNA HS Assay again for Illumina Sequencing.

RNA-Seq/hg19 read alignment:

Illumina paired-end RNA sequencing reads were aligned to GRCh37-lite genome-plus-junctions reference using BWA version 0.5.7. This reference combined genomic sequences in the GRCh37-lite assembly and exon-exon junction sequences whose corresponding coordinates were defined based on annotations of any transcripts in Ensembl (v59), Refseq and known genes from the UCSC genome browser, which was downloaded on August 19 2010, August 8 2010, and August 19 2010, respectively. Reads that mapped to junction regions were then repositioned back to the genome, and were marked with 'ZJ:Z' tags. BWA is run using default parameters, except that the option (-s) is included to disable Smith-Waterman alignment.

Finally, reads failing the Illumina chastity filter are flagged with a custom script, and duplicated reads were flagged with Picard Tools.

 

Structural variant detection

Was performed using ABySS (v1.3.2) and trans-ABySS (v1.4.6). For RNA-seq assembly alternate k-mers from k50-k96 were performed using positive strand and ambiguous stand reads as well as negative strand and ambiguous strand reads. The positive and negative strand assemblies were extended where possible, merged and then concatenated together to produce a meta-assembly contig dataset. The genome (WGS) libraries were assembled in single end mode using k-mer values of k24, and k44. The contigs and reads were then reassembled at k64 in single end mode and then finally at k64 in paired end mode. The meta-assemblies were then used as input to the trans-ABySS analysis pipeline (Robertson et al., 2010).

Large scale rearrangements and gene fusions from RNA-seq libraries were identified from contigs that had high confidence GMAP (v2012-12-20) alignments to two distinct genomic regions.  Evidence for the alignments were provided from aligning reads back to the contigs and from aligning reads to genomic coordinates. Events were then filtered on read thresholds. Large scale rearrangements and gene fusions from WGS libraries were identified in a similar way, but using BWA (v0.6.2-r126) alignments.

Insertions and deletions were identified by gapped alignment of contigs to the human reference using GMAP for RNA-seq and BWA for WGS. Confidence in the event was calculated from the alignment of reads back to the event breakpoint in the contigs.  The events were then screened against dbSNP and other variation databases to identify putative novel events.

To determine compartment specific events the structural variant calls for each patient from all matched genome and RNA-seq samples were concatenated together and screened against matching genome tumour, and where available germline bam files. This resulted in compartment specific structural variant events and where germline was available putative somatic and germline events. The events were further filtered against a compendium of germline structural variants to remove recurrent false positives.

SNV analysis of strand-specific RNA-seq data:

After repositioning, hg19-aligned BAM files were split into positive-fragment and negative-fragment BAM files based on the orientation of the paired-end reads. Unmapped and improperly paired aligned reads were put into the mix-fragment BAM. SNVs were then detected on positive- and negative-split BAMs separately using SNVMix2 (Goya et al., 2010) with parameters Mb and Q30.  The SNVs were further filtered to exclude those called based on 1) reference base N; 2) only 1 read supports the variant; 3) probability of heterozygous and homozygous of variant allele smaller than 0.99; 4) a position overlapping with insertions or deletions; 5) read supports from positions no more than 5 bases from read ends; 6) supports from reads only spanning an exon-exon junction; 7) more than 0.5 proportion of supporting reads were improper paired; 8) fewer than 2 proper-paired supporting reads.  SNVs located in exons equal or smaller than the read length, 100bp in this case, are a special case, because all their coverage may come from exon-exon junction spanning reads, so we also identified small-exonic SNVs that ware only supported by reads that spanning exon-exon junction but passed all other 7 filtering criteria mentioned above. These SNVs were finally annotated with SnpEff (Cingolani et al., 2012b) (Ensembl 66) and SnpSift (Cingolani et al., 2012a) (dbSNP137 and COSMIC64).

mRNA-Differential expression:

We used SAMseq (samr v2.0, R 2.15.0) two-class unpaired analyses with an FDR threshold of 0.05 to identify genes that were differentially expressed. For each run on a pair of sample groups, we first reduced the number of genes by removing those with median less than 5 RPKM in both groups, and those for which the Wilcoxon BH adjusted P-value between the two groups was greater than 0.05. This subset of genes was submitted to SAMseq. Each run generated a pair of files: genes ‘up’ and ‘down’. We then ranked the genes by a median-based fold change, and generated a figure showing up to 10 of the largest fold changes in each direction.

mRNA-NMF:

For specific mRNA-Seq expression datasets, we first removed genes expressed at or below a noise threshold of ≤ 0.2 reads per kilobase (of gene model) per million mapped reads (RPKM) in at least 75% of samples. We created the NMF input matrix using the top 25% most-variant genes, by ranking expressed genes having a mean RPKM of at least 10 by the coefficient of variation. We generated consensus clustering results with NMF v0.5.02 in R v1.12.0, with the default Brunet algorithm, and 200 iterations for the clustering run. Rank survey profiles for cophenetic and silhouette width suggest a specific cluster solution.

Strand-specific RNA-seq (plate based) library construction:

Total RNA samples were checked using an Agilent Bioanalyzer RNA nanochip or Caliper GX HT RNA LabChip, and samples passing quality control were arrayed into a 96-well plate. PolyA+ RNA was purified using the 96-well MultiMACS mRNA isolation kit on the MultiMACS 96 separator (Miltenyi Biotec, Germany) from 2ug total RNA with on-column DNaseI-treatment as per the manufacturer's instructions. The eluted PolyA+ RNA was ethanol precipitated and resuspended in 10µL of DEPC treated water with 1:20 SuperaseIN (Life Technologies, USA).

 

First-stranded cDNA was synthesized from the purified polyA+RNA using the Superscript cDNA Synthesis kit (Life Technologies, USA) and random hexamer primers at a concentration of 5µM along with a final concentration of 1ug/uL Actinomycin D, followed by Ampure XP SPRI beads on a Biomek FX robot (Beckman-Coulter, USA). The second strand cDNA was synthesized following the Superscript cDNA Synthesis protocol by replacing the dTTP with dUTP in dNTP mix, allowing second strand to be digested using UNG (Uracil-N-Glycosylase, Life Technologies, USA) in the post-adapter ligation reaction and thus achieving strand specificity.

The cDNA was quantified in a 96-well format using PicoGreen (Life Technologies, USA) and VICTOR3V Spectrophotometer (PerkinElmer, Inc. USA). The cDNA was fragmented by Covaris E210 sonication for 55 seconds at a “Duty cycle” of 20% and “Intensity” of 5. The paired-end sequencing library was prepared following the BC Cancer Agency Genome Sciences Centre strand-specific, plate-based and paired-end library construction protocol on a Biomek FX robot (Beckman-Coulter, USA). Briefly, the cDNA was purified in 96-well format using Ampure XP SPRI beads, and was subject to end-repair, and phosphorylation by T4 DNA polymerase, Klenow DNA Polymerase, and T4 polynucleotide kinase respectively in a single reaction, followed by cleanup using Ampure XP SPRI beads and 3’ A-tailing by Klenow fragment (3’ to 5’ exo minus). After purification using Ampure XP SPRI beads, picogreen quantification was performed to determine the amount of Illumina PE adapters to be used in the next step of adapter ligation reaction. The adapter-ligated products were purified using Ampure XP SPRI beads, and digested with UNG (1U/ul) at 37oC for 30 min followed by deactivation at 95oC for 15 min. The digested cDNA was purified using Ampure XP SPRI beads, and then PCR-amplified with Phusion DNA Polymerase (Thermo Fisher Scientific Inc. USA) using Illumina’s PE primer set,  with cycle condition 98˚C  30sec followed by 10-13 cycles of 98˚C  10 sec, 65˚C  30 sec and 72˚C  30 sec, and then 72˚C  5min. The PCR products were purified using Ampure XP SPRI beads, and checked with Caliper LabChip GX for DNA samples using the High Sensitivity Assay (PerkinElmer, Inc. USA). PCR product of the desired size range was purified using 8% PAGE, and the DNA quality was assessed and quantified using an Agilent DNA 1000 series II assay and Quant-iT dsDNA HS Assay Kit using Qubit fluorometer (Invitrogen), then diluted to 8nM. The final library concentration was double checked and determined by Quant-iT dsDNA HS Assay again for Illumina Sequencing.

RNA-Seq/hg19 read alignment:

Illumina paired-end RNA sequencing reads were aligned to GRCh37-lite genome-plus-junctions reference using BWA version 0.5.7. This reference combined genomic sequences in the GRCh37-lite assembly and exon-exon junction sequences whose corresponding coordinates were defined based on annotations of any transcripts in Ensembl (v59), Refseq and known genes from the UCSC genome browser, which was downloaded on August 19 2010, August 8 2010, and August 19 2010, respectively. Reads that mapped to junction regions were then repositioned back to the genome, and were marked with 'ZJ:Z' tags. BWA is run using default parameters, except that the option (-s) is included to disable Smith-Waterman alignment.

Finally, reads failing the Illumina chastity filter are flagged with a custom script, and duplicated reads were flagged with Picard Tools.

 

Structural variant detection

Was performed using ABySS (v1.3.2) and trans-ABySS (v1.4.6). For RNA-seq assembly alternate k-mers from k50-k96 were performed using positive strand and ambiguous stand reads as well as negative strand and ambiguous strand reads. The positive and negative strand assemblies were extended where possible, merged and then concatenated together to produce a meta-assembly contig dataset. The genome (WGS) libraries were assembled in single end mode using k-mer values of k24, and k44. The contigs and reads were then reassembled at k64 in single end mode and then finally at k64 in paired end mode. The meta-assemblies were then used as input to the trans-ABySS analysis pipeline (Robertson et al., 2010).

Large scale rearrangements and gene fusions from RNA-seq libraries were identified from contigs that had high confidence GMAP (v2012-12-20) alignments to two distinct genomic regions.  Evidence for the alignments were provided from aligning reads back to the contigs and from aligning reads to genomic coordinates. Events were then filtered on read thresholds. Large scale rearrangements and gene fusions from WGS libraries were identified in a similar way, but using BWA (v0.6.2-r126) alignments.

Insertions and deletions were identified by gapped alignment of contigs to the human reference using GMAP for RNA-seq and BWA for WGS. Confidence in the event was calculated from the alignment of reads back to the event breakpoint in the contigs.  The events were then screened against dbSNP and other variation databases to identify putative novel events.

To determine compartment specific events the structural variant calls for each patient from all matched genome and RNA-seq samples were concatenated together and screened against matching genome tumour, and where available germline bam files. This resulted in compartment specific structural variant events and where germline was available putative somatic and germline events. The events were further filtered against a compendium of germline structural variants to remove recurrent false positives.

SNV analysis of strand-specific RNA-seq data:

After repositioning, hg19-aligned BAM files were split into positive-fragment and negative-fragment BAM files based on the orientation of the paired-end reads. Unmapped and improperly paired aligned reads were put into the mix-fragment BAM. SNVs were then detected on positive- and negative-split BAMs separately using SNVMix2 (Goya et al., 2010) with parameters Mb and Q30.  The SNVs were further filtered to exclude those called based on 1) reference base N; 2) only 1 read supports the variant; 3) probability of heterozygous and homozygous of variant allele smaller than 0.99; 4) a position overlapping with insertions or deletions; 5) read supports from positions no more than 5 bases from read ends; 6) supports from reads only spanning an exon-exon junction; 7) more than 0.5 proportion of supporting reads were improper paired; 8) fewer than 2 proper-paired supporting reads.  SNVs located in exons equal or smaller than the read length, 100bp in this case, are a special case, because all their coverage may come from exon-exon junction spanning reads, so we also identified small-exonic SNVs that ware only supported by reads that spanning exon-exon junction but passed all other 7 filtering criteria mentioned above. These SNVs were finally annotated with SnpEff (Cingolani et al., 2012b) (Ensembl 66) and SnpSift (Cingolani et al., 2012a) (dbSNP137 and COSMIC64).

mRNA-Differential expression:

We used SAMseq (samr v2.0, R 2.15.0) two-class unpaired analyses with an FDR threshold of 0.05 to identify genes that were differentially expressed. For each run on a pair of sample groups, we first reduced the number of genes by removing those with median less than 5 RPKM in both groups, and those for which the Wilcoxon BH adjusted P-value between the two groups was greater than 0.05. This subset of genes was submitted to SAMseq. Each run generated a pair of files: genes ‘up’ and ‘down’. We then ranked the genes by a median-based fold change, and generated a figure showing up to 10 of the largest fold changes in each direction.

mRNA-NMF:

For specific mRNA-Seq expression datasets, we first removed genes expressed at or below a noise threshold of ≤ 0.2 reads per kilobase (of gene model) per million mapped reads (RPKM) in at least 75% of samples. We created the NMF input matrix using the top 25% most-variant genes, by ranking expressed genes having a mean RPKM of at least 10 by the coefficient of variation. We generated consensus clustering results with NMF v0.5.02 in R v1.12.0, with the default Brunet algorithm, and 200 iterations for the clustering run. Rank survey profiles for cophenetic and silhouette width suggest a specific cluster solution.

Strand-specific RNA-seq (plate based) library construction:

Total RNA samples were checked using an Agilent Bioanalyzer RNA nanochip or Caliper GX HT RNA LabChip, and samples passing quality control were arrayed into a 96-well plate. PolyA+ RNA was purified using the 96-well MultiMACS mRNA isolation kit on the MultiMACS 96 separator (Miltenyi Biotec, Germany) from 2ug total RNA with on-column DNaseI-treatment as per the manufacturer's instructions. The eluted PolyA+ RNA was ethanol precipitated and resuspended in 10µL of DEPC treated water with 1:20 SuperaseIN (Life Technologies, USA).

 

First-stranded cDNA was synthesized from the purified polyA+RNA using the Superscript cDNA Synthesis kit (Life Technologies, USA) and random hexamer primers at a concentration of 5µM along with a final concentration of 1ug/uL Actinomycin D, followed by Ampure XP SPRI beads on a Biomek FX robot (Beckman-Coulter, USA). The second strand cDNA was synthesized following the Superscript cDNA Synthesis protocol by replacing the dTTP with dUTP in dNTP mix, allowing second strand to be digested using UNG (Uracil-N-Glycosylase, Life Technologies, USA) in the post-adapter ligation reaction and thus achieving strand specificity.

The cDNA was quantified in a 96-well format using PicoGreen (Life Technologies, USA) and VICTOR3V Spectrophotometer (PerkinElmer, Inc. USA). The cDNA was fragmented by Covaris E210 sonication for 55 seconds at a “Duty cycle” of 20% and “Intensity” of 5. The paired-end sequencing library was prepared following the BC Cancer Agency Genome Sciences Centre strand-specific, plate-based and paired-end library construction protocol on a Biomek FX robot (Beckman-Coulter, USA). Briefly, the cDNA was purified in 96-well format using Ampure XP SPRI beads, and was subject to end-repair, and phosphorylation by T4 DNA polymerase, Klenow DNA Polymerase, and T4 polynucleotide kinase respectively in a single reaction, followed by cleanup using Ampure XP SPRI beads and 3’ A-tailing by Klenow fragment (3’ to 5’ exo minus). After purification using Ampure XP SPRI beads, picogreen quantification was performed to determine the amount of Illumina PE adapters to be used in the next step of adapter ligation reaction. The adapter-ligated products were purified using Ampure XP SPRI beads, and digested with UNG (1U/ul) at 37oC for 30 min followed by deactivation at 95oC for 15 min. The digested cDNA was purified using Ampure XP SPRI beads, and then PCR-amplified with Phusion DNA Polymerase (Thermo Fisher Scientific Inc. USA) using Illumina’s PE primer set,  with cycle condition 98˚C  30sec followed by 10-13 cycles of 98˚C  10 sec, 65˚C  30 sec and 72˚C  30 sec, and then 72˚C  5min. The PCR products were purified using Ampure XP SPRI beads, and checked with Caliper LabChip GX for DNA samples using the High Sensitivity Assay (PerkinElmer, Inc. USA). PCR product of the desired size range was purified using 8% PAGE, and the DNA quality was assessed and quantified using an Agilent DNA 1000 series II assay and Quant-iT dsDNA HS Assay Kit using Qubit fluorometer (Invitrogen), then diluted to 8nM. The final library concentration was double checked and determined by Quant-iT dsDNA HS Assay again for Illumina Sequencing.

RNA-Seq/hg19 read alignment:

Illumina paired-end RNA sequencing reads were aligned to GRCh37-lite genome-plus-junctions reference using BWA version 0.5.7. This reference combined genomic sequences in the GRCh37-lite assembly and exon-exon junction sequences whose corresponding coordinates were defined based on annotations of any transcripts in Ensembl (v59), Refseq and known genes from the UCSC genome browser, which was downloaded on August 19 2010, August 8 2010, and August 19 2010, respectively. Reads that mapped to junction regions were then repositioned back to the genome, and were marked with 'ZJ:Z' tags. BWA is run using default parameters, except that the option (-s) is included to disable Smith-Waterman alignment.

Finally, reads failing the Illumina chastity filter are flagged with a custom script, and duplicated reads were flagged with Picard Tools.

 

Structural variant detection

Was performed using ABySS (v1.3.2) and trans-ABySS (v1.4.6). For RNA-seq assembly alternate k-mers from k50-k96 were performed using positive strand and ambiguous stand reads as well as negative strand and ambiguous strand reads. The positive and negative strand assemblies were extended where possible, merged and then concatenated together to produce a meta-assembly contig dataset. The genome (WGS) libraries were assembled in single end mode using k-mer values of k24, and k44. The contigs and reads were then reassembled at k64 in single end mode and then finally at k64 in paired end mode. The meta-assemblies were then used as input to the trans-ABySS analysis pipeline (Robertson et al., 2010).

Large scale rearrangements and gene fusions from RNA-seq libraries were identified from contigs that had high confidence GMAP (v2012-12-20) alignments to two distinct genomic regions.  Evidence for the alignments were provided from aligning reads back to the contigs and from aligning reads to genomic coordinates. Events were then filtered on read thresholds. Large scale rearrangements and gene fusions from WGS libraries were identified in a similar way, but using BWA (v0.6.2-r126) alignments.

Insertions and deletions were identified by gapped alignment of contigs to the human reference using GMAP for RNA-seq and BWA for WGS. Confidence in the event was calculated from the alignment of reads back to the event breakpoint in the contigs.  The events were then screened against dbSNP and other variation databases to identify putative novel events.

To determine compartment specific events the structural variant calls for each patient from all matched genome and RNA-seq samples were concatenated together and screened against matching genome tumour, and where available germline bam files. This resulted in compartment specific structural variant events and where germline was available putative somatic and germline events. The events were further filtered against a compendium of germline structural variants to remove recurrent false positives.

SNV analysis of strand-specific RNA-seq data:

After repositioning, hg19-aligned BAM files were split into positive-fragment and negative-fragment BAM files based on the orientation of the paired-end reads. Unmapped and improperly paired aligned reads were put into the mix-fragment BAM. SNVs were then detected on positive- and negative-split BAMs separately using SNVMix2 (Goya et al., 2010) with parameters Mb and Q30.  The SNVs were further filtered to exclude those called based on 1) reference base N; 2) only 1 read supports the variant; 3) probability of heterozygous and homozygous of variant allele smaller than 0.99; 4) a position overlapping with insertions or deletions; 5) read supports from positions no more than 5 bases from read ends; 6) supports from reads only spanning an exon-exon junction; 7) more than 0.5 proportion of supporting reads were improper paired; 8) fewer than 2 proper-paired supporting reads.  SNVs located in exons equal or smaller than the read length, 100bp in this case, are a special case, because all their coverage may come from exon-exon junction spanning reads, so we also identified small-exonic SNVs that ware only supported by reads that spanning exon-exon junction but passed all other 7 filtering criteria mentioned above. These SNVs were finally annotated with SnpEff (Cingolani et al., 2012b) (Ensembl 66) and SnpSift (Cingolani et al., 2012a) (dbSNP137 and COSMIC64).

mRNA-Differential expression:

We used SAMseq (samr v2.0, R 2.15.0) two-class unpaired analyses with an FDR threshold of 0.05 to identify genes that were differentially expressed. For each run on a pair of sample groups, we first reduced the number of genes by removing those with median less than 5 RPKM in both groups, and those for which the Wilcoxon BH adjusted P-value between the two groups was greater than 0.05. This subset of genes was submitted to SAMseq. Each run generated a pair of files: genes ‘up’ and ‘down’. We then ranked the genes by a median-based fold change, and generated a figure showing up to 10 of the largest fold changes in each direction.

mRNA-NMF:

For specific mRNA-Seq expression datasets, we first removed genes expressed at or below a noise threshold of ≤ 0.2 reads per kilobase (of gene model) per million mapped reads (RPKM) in at least 75% of samples. We created the NMF input matrix using the top 25% most-variant genes, by ranking expressed genes having a mean RPKM of at least 10 by the coefficient of variation. We generated consensus clustering results with NMF v0.5.02 in R v1.12.0, with the default Brunet algorithm, and 200 iterations for the clustering run. Rank survey profiles for cophenetic and silhouette width suggest a specific cluster solution.

Strand-specific RNA-seq (plate based) library construction:

Total RNA samples were checked using an Agilent Bioanalyzer RNA nanochip or Caliper GX HT RNA LabChip, and samples passing quality control were arrayed into a 96-well plate. PolyA+ RNA was purified using the 96-well MultiMACS mRNA isolation kit on the MultiMACS 96 separator (Miltenyi Biotec, Germany) from 2ug total RNA with on-column DNaseI-treatment as per the manufacturer's instructions. The eluted PolyA+ RNA was ethanol precipitated and resuspended in 10µL of DEPC treated water with 1:20 SuperaseIN (Life Technologies, USA).

 

First-stranded cDNA was synthesized from the purified polyA+RNA using the Superscript cDNA Synthesis kit (Life Technologies, USA) and random hexamer primers at a concentration of 5µM along with a final concentration of 1ug/uL Actinomycin D, followed by Ampure XP SPRI beads on a Biomek FX robot (Beckman-Coulter, USA). The second strand cDNA was synthesized following the Superscript cDNA Synthesis protocol by replacing the dTTP with dUTP in dNTP mix, allowing second strand to be digested using UNG (Uracil-N-Glycosylase, Life Technologies, USA) in the post-adapter ligation reaction and thus achieving strand specificity.

The cDNA was quantified in a 96-well format using PicoGreen (Life Technologies, USA) and VICTOR3V Spectrophotometer (PerkinElmer, Inc. USA). The cDNA was fragmented by Covaris E210 sonication for 55 seconds at a “Duty cycle” of 20% and “Intensity” of 5. The paired-end sequencing library was prepared following the BC Cancer Agency Genome Sciences Centre strand-specific, plate-based and paired-end library construction protocol on a Biomek FX robot (Beckman-Coulter, USA). Briefly, the cDNA was purified in 96-well format using Ampure XP SPRI beads, and was subject to end-repair, and phosphorylation by T4 DNA polymerase, Klenow DNA Polymerase, and T4 polynucleotide kinase respectively in a single reaction, followed by cleanup using Ampure XP SPRI beads and 3’ A-tailing by Klenow fragment (3’ to 5’ exo minus). After purification using Ampure XP SPRI beads, picogreen quantification was performed to determine the amount of Illumina PE adapters to be used in the next step of adapter ligation reaction. The adapter-ligated products were purified using Ampure XP SPRI beads, and digested with UNG (1U/ul) at 37oC for 30 min followed by deactivation at 95oC for 15 min. The digested cDNA was purified using Ampure XP SPRI beads, and then PCR-amplified with Phusion DNA Polymerase (Thermo Fisher Scientific Inc. USA) using Illumina’s PE primer set,  with cycle condition 98˚C  30sec followed by 10-13 cycles of 98˚C  10 sec, 65˚C  30 sec and 72˚C  30 sec, and then 72˚C  5min. The PCR products were purified using Ampure XP SPRI beads, and checked with Caliper LabChip GX for DNA samples using the High Sensitivity Assay (PerkinElmer, Inc. USA). PCR product of the desired size range was purified using 8% PAGE, and the DNA quality was assessed and quantified using an Agilent DNA 1000 series II assay and Quant-iT dsDNA HS Assay Kit using Qubit fluorometer (Invitrogen), then diluted to 8nM. The final library concentration was double checked and determined by Quant-iT dsDNA HS Assay again for Illumina Sequencing.

RNA-Seq/hg19 read alignment:

Illumina paired-end RNA sequencing reads were aligned to GRCh37-lite genome-plus-junctions reference using BWA version 0.5.7. This reference combined genomic sequences in the GRCh37-lite assembly and exon-exon junction sequences whose corresponding coordinates were defined based on annotations of any transcripts in Ensembl (v59), Refseq and known genes from the UCSC genome browser, which was downloaded on August 19 2010, August 8 2010, and August 19 2010, respectively. Reads that mapped to junction regions were then repositioned back to the genome, and were marked with 'ZJ:Z' tags. BWA is run using default parameters, except that the option (-s) is included to disable Smith-Waterman alignment.

Finally, reads failing the Illumina chastity filter are flagged with a custom script, and duplicated reads were flagged with Picard Tools.

 

Structural variant detection

Was performed using ABySS (v1.3.2) and trans-ABySS (v1.4.6). For RNA-seq assembly alternate k-mers from k50-k96 were performed using positive strand and ambiguous stand reads as well as negative strand and ambiguous strand reads. The positive and negative strand assemblies were extended where possible, merged and then concatenated together to produce a meta-assembly contig dataset. The genome (WGS) libraries were assembled in single end mode using k-mer values of k24, and k44. The contigs and reads were then reassembled at k64 in single end mode and then finally at k64 in paired end mode. The meta-assemblies were then used as input to the trans-ABySS analysis pipeline (Robertson et al., 2010).

Large scale rearrangements and gene fusions from RNA-seq libraries were identified from contigs that had high confidence GMAP (v2012-12-20) alignments to two distinct genomic regions.  Evidence for the alignments were provided from aligning reads back to the contigs and from aligning reads to genomic coordinates. Events were then filtered on read thresholds. Large scale rearrangements and gene fusions from WGS libraries were identified in a similar way, but using BWA (v0.6.2-r126) alignments.

Insertions and deletions were identified by gapped alignment of contigs to the human reference using GMAP for RNA-seq and BWA for WGS. Confidence in the event was calculated from the alignment of reads back to the event breakpoint in the contigs.  The events were then screened against dbSNP and other variation databases to identify putative novel events.

To determine compartment specific events the structural variant calls for each patient from all matched genome and RNA-seq samples were concatenated together and screened against matching genome tumour, and where available germline bam files. This resulted in compartment specific structural variant events and where germline was available putative somatic and germline events. The events were further filtered against a compendium of germline structural variants to remove recurrent false positives.

SNV analysis of strand-specific RNA-seq data:

After repositioning, hg19-aligned BAM files were split into positive-fragment and negative-fragment BAM files based on the orientation of the paired-end reads. Unmapped and improperly paired aligned reads were put into the mix-fragment BAM. SNVs were then detected on positive- and negative-split BAMs separately using SNVMix2 (Goya et al., 2010) with parameters Mb and Q30.  The SNVs were further filtered to exclude those called based on 1) reference base N; 2) only 1 read supports the variant; 3) probability of heterozygous and homozygous of variant allele smaller than 0.99; 4) a position overlapping with insertions or deletions; 5) read supports from positions no more than 5 bases from read ends; 6) supports from reads only spanning an exon-exon junction; 7) more than 0.5 proportion of supporting reads were improper paired; 8) fewer than 2 proper-paired supporting reads.  SNVs located in exons equal or smaller than the read length, 100bp in this case, are a special case, because all their coverage may come from exon-exon junction spanning reads, so we also identified small-exonic SNVs that ware only supported by reads that spanning exon-exon junction but passed all other 7 filtering criteria mentioned above. These SNVs were finally annotated with SnpEff (Cingolani et al., 2012b) (Ensembl 66) and SnpSift (Cingolani et al., 2012a) (dbSNP137 and COSMIC64).

mRNA-Differential expression:

We used SAMseq (samr v2.0, R 2.15.0) two-class unpaired analyses with an FDR threshold of 0.05 to identify genes that were differentially expressed. For each run on a pair of sample groups, we first reduced the number of genes by removing those with median less than 5 RPKM in both groups, and those for which the Wilcoxon BH adjusted P-value between the two groups was greater than 0.05. This subset of genes was submitted to SAMseq. Each run generated a pair of files: genes ‘up’ and ‘down’. We then ranked the genes by a median-based fold change, and generated a figure showing up to 10 of the largest fold changes in each direction.

mRNA-NMF:

For specific mRNA-Seq expression datasets, we first removed genes expressed at or below a noise threshold of ≤ 0.2 reads per kilobase (of gene model) per million mapped reads (RPKM) in at least 75% of samples. We created the NMF input matrix using the top 25% most-variant genes, by ranking expressed genes having a mean RPKM of at least 10 by the coefficient of variation. We generated consensus clustering results with NMF v0.5.02 in R v1.12.0, with the default Brunet algorithm, and 200 iterations for the clustering run. Rank survey profiles for cophenetic and silhouette width suggest a specific cluster solution.

Strand-specific RNA-seq (plate based) library construction:

Total RNA samples were checked using an Agilent Bioanalyzer RNA nanochip or Caliper GX HT RNA LabChip, and samples passing quality control were arrayed into a 96-well plate. PolyA+ RNA was purified using the 96-well MultiMACS mRNA isolation kit on the MultiMACS 96 separator (Miltenyi Biotec, Germany) from 2ug total RNA with on-column DNaseI-treatment as per the manufacturer's instructions. The eluted PolyA+ RNA was ethanol precipitated and resuspended in 10µL of DEPC treated water with 1:20 SuperaseIN (Life Technologies, USA).

 

First-stranded cDNA was synthesized from the purified polyA+RNA using the Superscript cDNA Synthesis kit (Life Technologies, USA) and random hexamer primers at a concentration of 5µM along with a final concentration of 1ug/uL Actinomycin D, followed by Ampure XP SPRI beads on a Biomek FX robot (Beckman-Coulter, USA). The second strand cDNA was synthesized following the Superscript cDNA Synthesis protocol by replacing the dTTP with dUTP in dNTP mix, allowing second strand to be digested using UNG (Uracil-N-Glycosylase, Life Technologies, USA) in the post-adapter ligation reaction and thus achieving strand specificity.

The cDNA was quantified in a 96-well format using PicoGreen (Life Technologies, USA) and VICTOR3V Spectrophotometer (PerkinElmer, Inc. USA). The cDNA was fragmented by Covaris E210 sonication for 55 seconds at a “Duty cycle” of 20% and “Intensity” of 5. The paired-end sequencing library was prepared following the BC Cancer Agency Genome Sciences Centre strand-specific, plate-based and paired-end library construction protocol on a Biomek FX robot (Beckman-Coulter, USA). Briefly, the cDNA was purified in 96-well format using Ampure XP SPRI beads, and was subject to end-repair, and phosphorylation by T4 DNA polymerase, Klenow DNA Polymerase, and T4 polynucleotide kinase respectively in a single reaction, followed by cleanup using Ampure XP SPRI beads and 3’ A-tailing by Klenow fragment (3’ to 5’ exo minus). After purification using Ampure XP SPRI beads, picogreen quantification was performed to determine the amount of Illumina PE adapters to be used in the next step of adapter ligation reaction. The adapter-ligated products were purified using Ampure XP SPRI beads, and digested with UNG (1U/ul) at 37oC for 30 min followed by deactivation at 95oC for 15 min. The digested cDNA was purified using Ampure XP SPRI beads, and then PCR-amplified with Phusion DNA Polymerase (Thermo Fisher Scientific Inc. USA) using Illumina’s PE primer set,  with cycle condition 98˚C  30sec followed by 10-13 cycles of 98˚C  10 sec, 65˚C  30 sec and 72˚C  30 sec, and then 72˚C  5min. The PCR products were purified using Ampure XP SPRI beads, and checked with Caliper LabChip GX for DNA samples using the High Sensitivity Assay (PerkinElmer, Inc. USA). PCR product of the desired size range was purified using 8% PAGE, and the DNA quality was assessed and quantified using an Agilent DNA 1000 series II assay and Quant-iT dsDNA HS Assay Kit using Qubit fluorometer (Invitrogen), then diluted to 8nM. The final library concentration was double checked and determined by Quant-iT dsDNA HS Assay again for Illumina Sequencing.

Strand-specific ribodepletion RNA sequencing:

Enzymatic reactions were set-up in a 96-well plate (Thermo Fisher Scientific) on a Microlab NIMBUS liquid handler (Hamilton Robotics, USA). 100ng of DNase I treated total RNA in 6 µL was hybridized to rRNA probes in a 7.5 µL reaction. Heat-sealed plates were incubated at 95oC for 2 minutes followed by incremental reduction in temperature by 0.1oC per second to 22oC (730 cycles). The rRNA in DNA hybrids were digested using RNase H in a 10 µL reaction incubated in a thermocycler at 37oC for 30 minutes. To remove excess rRNA probes (DNA) and residual genomic DNA contamination, DNase I was added in a total reaction volume of 25 µL and incubated at 37oC for 30 minutes. RNA was purified using RNA MagClean DX beads (Aline Biosciences, USA) with 15 minutes of binding time, 7 minutes clearing on a magnet followed by two 70% ethanol washes, 5 minutes to air dry the RNA pellet and elution in 36uL DEPC water. The plate containing RNA was stored at -80oC prior to cDNA synthesis.

First-strand cDNA was synthesized from the purified RNA (minus rRNA) using the Maxima H Minus First Strand cDNA Synthesis kit (Thermo-Fisher, USA) and random hexamer primers at a concentration of 8ng/µL along with a final concentration of 0.4µg/µL Actinomycin D, followed by PCR Clean DX bead purification on a Microlab NIMBUS robot (Hamilton Robotics, USA). The second strand cDNA was synthesized following the NEBNext Ultra Directional Second Strand cDNA Synthesis protocol (NEB) that incorporates dUTP in the dNTP mix, allowing the second strand to be digested using USERTM enzyme (NEB) in the post-adapter ligation reaction and thus achieving strand specificity.

cDNA was fragmented by Covaris LE220 sonication for 130seconds (2x65seconds) at a “Duty cycle” of 30%, 450 Peak Incident Power (W) and 200 Cycles per Burst in a 96-well microTUBE Plate (P/N: 520078) to achieve 200-250 bp average fragment lengths. The paired-end sequencing library was prepared following the BC Cancer Agency Genome Sciences Centre strand-specific, plate-based library construction protocol on a Microlab NIMBUS robot (Hamilton Robotics, USA). Briefly, the sheared cDNA was subject to end-repair and phosphorylation in a single reaction using an enzyme premix (NEB) containing T4 DNA polymerase, Klenow DNA Polymerase and T4 polynucleotide kinase, incubated at 20oC for 30 minutes. Repaired cDNA was purified in 96-well format using PCR Clean DX beads (Aline Biosciences, USA), and 3’ A-tailed (adenylation) using Klenow fragment (3’ to 5’ exo minus) and incubation at 37oC for 30 minutes prior to enzyme heat inactivation. Illumina PE adapters were ligated at 20oC for 15 minutes. The adapter-ligated products were purified using PCR Clean DX beads, then digested with USERTM enzyme (1 U/µL, NEB) at 37oC for 15 minutes followed immediately by 13 cycles of indexed PCR using Phusion DNA Polymerase (Thermo Fisher Scientific Inc. USA) and Illumina’s PE primer set. PCR parameters: 98˚C for 1 minute followed by 13 cycles of 98˚C 15 seconds, 65˚C 30 seconds and 72˚C 30 seconds, and then 72˚C 5 minutes. The PCR products were purified and size-selected using a 1:1 PCR Clean DX beads-to-sample ratio (twice), and the eluted DNA quality was assessed with Caliper LabChip GX for DNA samples using the High Sensitivity Assay (PerkinElmer, Inc. USA) and quantified using a Quant-iT dsDNA High Sensitivity Assay Kit on a Qubit fluorometer (Invitrogen) prior to library pooling and size-corrected final molar concentration calculation for Illumina HiSeq2500 sequencing with paired-end 75 base reads.

Strand-specific RNA-seq (plate based) library construction:

Total RNA samples were checked using an Agilent Bioanalyzer RNA nanochip or Caliper GX HT RNA LabChip, and samples passing quality control were arrayed into a 96-well plate. PolyA+ RNA was purified using the 96-well MultiMACS mRNA isolation kit on the MultiMACS 96 separator (Miltenyi Biotec, Germany) from 2ug total RNA with on-column DNaseI-treatment as per the manufacturer's instructions. The eluted PolyA+ RNA was ethanol precipitated and resuspended in 10µL of DEPC treated water with 1:20 SuperaseIN (Life Technologies, USA).

 

First-stranded cDNA was synthesized from the purified polyA+RNA using the Superscript cDNA Synthesis kit (Life Technologies, USA) and random hexamer primers at a concentration of 5µM along with a final concentration of 1ug/uL Actinomycin D, followed by Ampure XP SPRI beads on a Biomek FX robot (Beckman-Coulter, USA). The second strand cDNA was synthesized following the Superscript cDNA Synthesis protocol by replacing the dTTP with dUTP in dNTP mix, allowing second strand to be digested using UNG (Uracil-N-Glycosylase, Life Technologies, USA) in the post-adapter ligation reaction and thus achieving strand specificity.

The cDNA was quantified in a 96-well format using PicoGreen (Life Technologies, USA) and VICTOR3V Spectrophotometer (PerkinElmer, Inc. USA). The cDNA was fragmented by Covaris E210 sonication for 55 seconds at a “Duty cycle” of 20% and “Intensity” of 5. The paired-end sequencing library was prepared following the BC Cancer Agency Genome Sciences Centre strand-specific, plate-based and paired-end library construction protocol on a Biomek FX robot (Beckman-Coulter, USA). Briefly, the cDNA was purified in 96-well format using Ampure XP SPRI beads, and was subject to end-repair, and phosphorylation by T4 DNA polymerase, Klenow DNA Polymerase, and T4 polynucleotide kinase respectively in a single reaction, followed by cleanup using Ampure XP SPRI beads and 3’ A-tailing by Klenow fragment (3’ to 5’ exo minus). After purification using Ampure XP SPRI beads, picogreen quantification was performed to determine the amount of Illumina PE adapters to be used in the next step of adapter ligation reaction. The adapter-ligated products were purified using Ampure XP SPRI beads, and digested with UNG (1U/ul) at 37oC for 30 min followed by deactivation at 95oC for 15 min. The digested cDNA was purified using Ampure XP SPRI beads, and then PCR-amplified with Phusion DNA Polymerase (Thermo Fisher Scientific Inc. USA) using Illumina’s PE primer set,  with cycle condition 98˚C  30sec followed by 10-13 cycles of 98˚C  10 sec, 65˚C  30 sec and 72˚C  30 sec, and then 72˚C  5min. The PCR products were purified using Ampure XP SPRI beads, and checked with Caliper LabChip GX for DNA samples using the High Sensitivity Assay (PerkinElmer, Inc. USA). PCR product of the desired size range was purified using 8% PAGE, and the DNA quality was assessed and quantified using an Agilent DNA 1000 series II assay and Quant-iT dsDNA HS Assay Kit using Qubit fluorometer (Invitrogen), then diluted to 8nM. The final library concentration was double checked and determined by Quant-iT dsDNA HS Assay again for Illumina Sequencing.

RNA-Seq/hg19 read alignment:

Illumina paired-end RNA sequencing reads were aligned to GRCh37-lite genome-plus-junctions reference using BWA version 0.5.7. This reference combined genomic sequences in the GRCh37-lite assembly and exon-exon junction sequences whose corresponding coordinates were defined based on annotations of any transcripts in Ensembl (v59), Refseq and known genes from the UCSC genome browser, which was downloaded on August 19 2010, August 8 2010, and August 19 2010, respectively. Reads that mapped to junction regions were then repositioned back to the genome, and were marked with 'ZJ:Z' tags. BWA is run using default parameters, except that the option (-s) is included to disable Smith-Waterman alignment.

Finally, reads failing the Illumina chastity filter are flagged with a custom script, and duplicated reads were flagged with Picard Tools.

 

Structural variant detection

Was performed using ABySS (v1.3.2) and trans-ABySS (v1.4.6). For RNA-seq assembly alternate k-mers from k50-k96 were performed using positive strand and ambiguous stand reads as well as negative strand and ambiguous strand reads. The positive and negative strand assemblies were extended where possible, merged and then concatenated together to produce a meta-assembly contig dataset. The genome (WGS) libraries were assembled in single end mode using k-mer values of k24, and k44. The contigs and reads were then reassembled at k64 in single end mode and then finally at k64 in paired end mode. The meta-assemblies were then used as input to the trans-ABySS analysis pipeline (Robertson et al., 2010).

Large scale rearrangements and gene fusions from RNA-seq libraries were identified from contigs that had high confidence GMAP (v2012-12-20) alignments to two distinct genomic regions.  Evidence for the alignments were provided from aligning reads back to the contigs and from aligning reads to genomic coordinates. Events were then filtered on read thresholds. Large scale rearrangements and gene fusions from WGS libraries were identified in a similar way, but using BWA (v0.6.2-r126) alignments.

Insertions and deletions were identified by gapped alignment of contigs to the human reference using GMAP for RNA-seq and BWA for WGS. Confidence in the event was calculated from the alignment of reads back to the event breakpoint in the contigs.  The events were then screened against dbSNP and other variation databases to identify putative novel events.

To determine compartment specific events the structural variant calls for each patient from all matched genome and RNA-seq samples were concatenated together and screened against matching genome tumour, and where available germline bam files. This resulted in compartment specific structural variant events and where germline was available putative somatic and germline events. The events were further filtered against a compendium of germline structural variants to remove recurrent false positives.

SNV analysis of strand-specific RNA-seq data:

After repositioning, hg19-aligned BAM files were split into positive-fragment and negative-fragment BAM files based on the orientation of the paired-end reads. Unmapped and improperly paired aligned reads were put into the mix-fragment BAM. SNVs were then detected on positive- and negative-split BAMs separately using SNVMix2 (Goya et al., 2010) with parameters Mb and Q30.  The SNVs were further filtered to exclude those called based on 1) reference base N; 2) only 1 read supports the variant; 3) probability of heterozygous and homozygous of variant allele smaller than 0.99; 4) a position overlapping with insertions or deletions; 5) read supports from positions no more than 5 bases from read ends; 6) supports from reads only spanning an exon-exon junction; 7) more than 0.5 proportion of supporting reads were improper paired; 8) fewer than 2 proper-paired supporting reads.  SNVs located in exons equal or smaller than the read length, 100bp in this case, are a special case, because all their coverage may come from exon-exon junction spanning reads, so we also identified small-exonic SNVs that ware only supported by reads that spanning exon-exon junction but passed all other 7 filtering criteria mentioned above. These SNVs were finally annotated with SnpEff (Cingolani et al., 2012b) (Ensembl 66) and SnpSift (Cingolani et al., 2012a) (dbSNP137 and COSMIC64).

mRNA-Differential expression:

We used SAMseq (samr v2.0, R 2.15.0) two-class unpaired analyses with an FDR threshold of 0.05 to identify genes that were differentially expressed. For each run on a pair of sample groups, we first reduced the number of genes by removing those with median less than 5 RPKM in both groups, and those for which the Wilcoxon BH adjusted P-value between the two groups was greater than 0.05. This subset of genes was submitted to SAMseq. Each run generated a pair of files: genes ‘up’ and ‘down’. We then ranked the genes by a median-based fold change, and generated a figure showing up to 10 of the largest fold changes in each direction.

mRNA-NMF:

For specific mRNA-Seq expression datasets, we first removed genes expressed at or below a noise threshold of ≤ 0.2 reads per kilobase (of gene model) per million mapped reads (RPKM) in at least 75% of samples. We created the NMF input matrix using the top 25% most-variant genes, by ranking expressed genes having a mean RPKM of at least 10 by the coefficient of variation. We generated consensus clustering results with NMF v0.5.02 in R v1.12.0, with the default Brunet algorithm, and 200 iterations for the clustering run. Rank survey profiles for cophenetic and silhouette width suggest a specific cluster solution.

Strand-specific RNA-seq (plate based) library construction:

Total RNA samples were checked using an Agilent Bioanalyzer RNA nanochip or Caliper GX HT RNA LabChip, and samples passing quality control were arrayed into a 96-well plate. PolyA+ RNA was purified using the 96-well MultiMACS mRNA isolation kit on the MultiMACS 96 separator (Miltenyi Biotec, Germany) from 2ug total RNA with on-column DNaseI-treatment as per the manufacturer's instructions. The eluted PolyA+ RNA was ethanol precipitated and resuspended in 10µL of DEPC treated water with 1:20 SuperaseIN (Life Technologies, USA).

 

First-stranded cDNA was synthesized from the purified polyA+RNA using the Superscript cDNA Synthesis kit (Life Technologies, USA) and random hexamer primers at a concentration of 5µM along with a final concentration of 1ug/uL Actinomycin D, followed by Ampure XP SPRI beads on a Biomek FX robot (Beckman-Coulter, USA). The second strand cDNA was synthesized following the Superscript cDNA Synthesis protocol by replacing the dTTP with dUTP in dNTP mix, allowing second strand to be digested using UNG (Uracil-N-Glycosylase, Life Technologies, USA) in the post-adapter ligation reaction and thus achieving strand specificity.

The cDNA was quantified in a 96-well format using PicoGreen (Life Technologies, USA) and VICTOR3V Spectrophotometer (PerkinElmer, Inc. USA). The cDNA was fragmented by Covaris E210 sonication for 55 seconds at a “Duty cycle” of 20% and “Intensity” of 5. The paired-end sequencing library was prepared following the BC Cancer Agency Genome Sciences Centre strand-specific, plate-based and paired-end library construction protocol on a Biomek FX robot (Beckman-Coulter, USA). Briefly, the cDNA was purified in 96-well format using Ampure XP SPRI beads, and was subject to end-repair, and phosphorylation by T4 DNA polymerase, Klenow DNA Polymerase, and T4 polynucleotide kinase respectively in a single reaction, followed by cleanup using Ampure XP SPRI beads and 3’ A-tailing by Klenow fragment (3’ to 5’ exo minus). After purification using Ampure XP SPRI beads, picogreen quantification was performed to determine the amount of Illumina PE adapters to be used in the next step of adapter ligation reaction. The adapter-ligated products were purified using Ampure XP SPRI beads, and digested with UNG (1U/ul) at 37oC for 30 min followed by deactivation at 95oC for 15 min. The digested cDNA was purified using Ampure XP SPRI beads, and then PCR-amplified with Phusion DNA Polymerase (Thermo Fisher Scientific Inc. USA) using Illumina’s PE primer set,  with cycle condition 98˚C  30sec followed by 10-13 cycles of 98˚C  10 sec, 65˚C  30 sec and 72˚C  30 sec, and then 72˚C  5min. The PCR products were purified using Ampure XP SPRI beads, and checked with Caliper LabChip GX for DNA samples using the High Sensitivity Assay (PerkinElmer, Inc. USA). PCR product of the desired size range was purified using 8% PAGE, and the DNA quality was assessed and quantified using an Agilent DNA 1000 series II assay and Quant-iT dsDNA HS Assay Kit using Qubit fluorometer (Invitrogen), then diluted to 8nM. The final library concentration was double checked and determined by Quant-iT dsDNA HS Assay again for Illumina Sequencing.

RNA-Seq/hg19 read alignment:

Illumina paired-end RNA sequencing reads were aligned to GRCh37-lite genome-plus-junctions reference using BWA version 0.5.7. This reference combined genomic sequences in the GRCh37-lite assembly and exon-exon junction sequences whose corresponding coordinates were defined based on annotations of any transcripts in Ensembl (v59), Refseq and known genes from the UCSC genome browser, which was downloaded on August 19 2010, August 8 2010, and August 19 2010, respectively. Reads that mapped to junction regions were then repositioned back to the genome, and were marked with 'ZJ:Z' tags. BWA is run using default parameters, except that the option (-s) is included to disable Smith-Waterman alignment.

Finally, reads failing the Illumina chastity filter are flagged with a custom script, and duplicated reads were flagged with Picard Tools.

 

Structural variant detection

Was performed using ABySS (v1.3.2) and trans-ABySS (v1.4.6). For RNA-seq assembly alternate k-mers from k50-k96 were performed using positive strand and ambiguous stand reads as well as negative strand and ambiguous strand reads. The positive and negative strand assemblies were extended where possible, merged and then concatenated together to produce a meta-assembly contig dataset. The genome (WGS) libraries were assembled in single end mode using k-mer values of k24, and k44. The contigs and reads were then reassembled at k64 in single end mode and then finally at k64 in paired end mode. The meta-assemblies were then used as input to the trans-ABySS analysis pipeline (Robertson et al., 2010).

Large scale rearrangements and gene fusions from RNA-seq libraries were identified from contigs that had high confidence GMAP (v2012-12-20) alignments to two distinct genomic regions.  Evidence for the alignments were provided from aligning reads back to the contigs and from aligning reads to genomic coordinates. Events were then filtered on read thresholds. Large scale rearrangements and gene fusions from WGS libraries were identified in a similar way, but using BWA (v0.6.2-r126) alignments.

Insertions and deletions were identified by gapped alignment of contigs to the human reference using GMAP for RNA-seq and BWA for WGS. Confidence in the event was calculated from the alignment of reads back to the event breakpoint in the contigs.  The events were then screened against dbSNP and other variation databases to identify putative novel events.

To determine compartment specific events the structural variant calls for each patient from all matched genome and RNA-seq samples were concatenated together and screened against matching genome tumour, and where available germline bam files. This resulted in compartment specific structural variant events and where germline was available putative somatic and germline events. The events were further filtered against a compendium of germline structural variants to remove recurrent false positives.

SNV analysis of strand-specific RNA-seq data:

After repositioning, hg19-aligned BAM files were split into positive-fragment and negative-fragment BAM files based on the orientation of the paired-end reads. Unmapped and improperly paired aligned reads were put into the mix-fragment BAM. SNVs were then detected on positive- and negative-split BAMs separately using SNVMix2 (Goya et al., 2010) with parameters Mb and Q30.  The SNVs were further filtered to exclude those called based on 1) reference base N; 2) only 1 read supports the variant; 3) probability of heterozygous and homozygous of variant allele smaller than 0.99; 4) a position overlapping with insertions or deletions; 5) read supports from positions no more than 5 bases from read ends; 6) supports from reads only spanning an exon-exon junction; 7) more than 0.5 proportion of supporting reads were improper paired; 8) fewer than 2 proper-paired supporting reads.  SNVs located in exons equal or smaller than the read length, 100bp in this case, are a special case, because all their coverage may come from exon-exon junction spanning reads, so we also identified small-exonic SNVs that ware only supported by reads that spanning exon-exon junction but passed all other 7 filtering criteria mentioned above. These SNVs were finally annotated with SnpEff (Cingolani et al., 2012b) (Ensembl 66) and SnpSift (Cingolani et al., 2012a) (dbSNP137 and COSMIC64).

mRNA-Differential expression:

We used SAMseq (samr v2.0, R 2.15.0) two-class unpaired analyses with an FDR threshold of 0.05 to identify genes that were differentially expressed. For each run on a pair of sample groups, we first reduced the number of genes by removing those with median less than 5 RPKM in both groups, and those for which the Wilcoxon BH adjusted P-value between the two groups was greater than 0.05. This subset of genes was submitted to SAMseq. Each run generated a pair of files: genes ‘up’ and ‘down’. We then ranked the genes by a median-based fold change, and generated a figure showing up to 10 of the largest fold changes in each direction.

mRNA-NMF:

For specific mRNA-Seq expression datasets, we first removed genes expressed at or below a noise threshold of ≤ 0.2 reads per kilobase (of gene model) per million mapped reads (RPKM) in at least 75% of samples. We created the NMF input matrix using the top 25% most-variant genes, by ranking expressed genes having a mean RPKM of at least 10 by the coefficient of variation. We generated consensus clustering results with NMF v0.5.02 in R v1.12.0, with the default Brunet algorithm, and 200 iterations for the clustering run. Rank survey profiles for cophenetic and silhouette width suggest a specific cluster solution.

Strand-specific RNA-seq (plate based) library construction:

Total RNA samples were checked using an Agilent Bioanalyzer RNA nanochip or Caliper GX HT RNA LabChip, and samples passing quality control were arrayed into a 96-well plate. PolyA+ RNA was purified using the 96-well MultiMACS mRNA isolation kit on the MultiMACS 96 separator (Miltenyi Biotec, Germany) from 2ug total RNA with on-column DNaseI-treatment as per the manufacturer's instructions. The eluted PolyA+ RNA was ethanol precipitated and resuspended in 10µL of DEPC treated water with 1:20 SuperaseIN (Life Technologies, USA).

 

First-stranded cDNA was synthesized from the purified polyA+RNA using the Superscript cDNA Synthesis kit (Life Technologies, USA) and random hexamer primers at a concentration of 5µM along with a final concentration of 1ug/uL Actinomycin D, followed by Ampure XP SPRI beads on a Biomek FX robot (Beckman-Coulter, USA). The second strand cDNA was synthesized following the Superscript cDNA Synthesis protocol by replacing the dTTP with dUTP in dNTP mix, allowing second strand to be digested using UNG (Uracil-N-Glycosylase, Life Technologies, USA) in the post-adapter ligation reaction and thus achieving strand specificity.

The cDNA was quantified in a 96-well format using PicoGreen (Life Technologies, USA) and VICTOR3V Spectrophotometer (PerkinElmer, Inc. USA). The cDNA was fragmented by Covaris E210 sonication for 55 seconds at a “Duty cycle” of 20% and “Intensity” of 5. The paired-end sequencing library was prepared following the BC Cancer Agency Genome Sciences Centre strand-specific, plate-based and paired-end library construction protocol on a Biomek FX robot (Beckman-Coulter, USA). Briefly, the cDNA was purified in 96-well format using Ampure XP SPRI beads, and was subject to end-repair, and phosphorylation by T4 DNA polymerase, Klenow DNA Polymerase, and T4 polynucleotide kinase respectively in a single reaction, followed by cleanup using Ampure XP SPRI beads and 3’ A-tailing by Klenow fragment (3’ to 5’ exo minus). After purification using Ampure XP SPRI beads, picogreen quantification was performed to determine the amount of Illumina PE adapters to be used in the next step of adapter ligation reaction. The adapter-ligated products were purified using Ampure XP SPRI beads, and digested with UNG (1U/ul) at 37oC for 30 min followed by deactivation at 95oC for 15 min. The digested cDNA was purified using Ampure XP SPRI beads, and then PCR-amplified with Phusion DNA Polymerase (Thermo Fisher Scientific Inc. USA) using Illumina’s PE primer set,  with cycle condition 98˚C  30sec followed by 10-13 cycles of 98˚C  10 sec, 65˚C  30 sec and 72˚C  30 sec, and then 72˚C  5min. The PCR products were purified using Ampure XP SPRI beads, and checked with Caliper LabChip GX for DNA samples using the High Sensitivity Assay (PerkinElmer, Inc. USA). PCR product of the desired size range was purified using 8% PAGE, and the DNA quality was assessed and quantified using an Agilent DNA 1000 series II assay and Quant-iT dsDNA HS Assay Kit using Qubit fluorometer (Invitrogen), then diluted to 8nM. The final library concentration was double checked and determined by Quant-iT dsDNA HS Assay again for Illumina Sequencing.

RNA-Seq/hg19 read alignment:

Illumina paired-end RNA sequencing reads were aligned to GRCh37-lite genome-plus-junctions reference using BWA version 0.5.7. This reference combined genomic sequences in the GRCh37-lite assembly and exon-exon junction sequences whose corresponding coordinates were defined based on annotations of any transcripts in Ensembl (v59), Refseq and known genes from the UCSC genome browser, which was downloaded on August 19 2010, August 8 2010, and August 19 2010, respectively. Reads that mapped to junction regions were then repositioned back to the genome, and were marked with 'ZJ:Z' tags. BWA is run using default parameters, except that the option (-s) is included to disable Smith-Waterman alignment.

Finally, reads failing the Illumina chastity filter are flagged with a custom script, and duplicated reads were flagged with Picard Tools.

 

Structural variant detection

Was performed using ABySS (v1.3.2) and trans-ABySS (v1.4.6). For RNA-seq assembly alternate k-mers from k50-k96 were performed using positive strand and ambiguous stand reads as well as negative strand and ambiguous strand reads. The positive and negative strand assemblies were extended where possible, merged and then concatenated together to produce a meta-assembly contig dataset. The genome (WGS) libraries were assembled in single end mode using k-mer values of k24, and k44. The contigs and reads were then reassembled at k64 in single end mode and then finally at k64 in paired end mode. The meta-assemblies were then used as input to the trans-ABySS analysis pipeline (Robertson et al., 2010).

Large scale rearrangements and gene fusions from RNA-seq libraries were identified from contigs that had high confidence GMAP (v2012-12-20) alignments to two distinct genomic regions.  Evidence for the alignments were provided from aligning reads back to the contigs and from aligning reads to genomic coordinates. Events were then filtered on read thresholds. Large scale rearrangements and gene fusions from WGS libraries were identified in a similar way, but using BWA (v0.6.2-r126) alignments.

Insertions and deletions were identified by gapped alignment of contigs to the human reference using GMAP for RNA-seq and BWA for WGS. Confidence in the event was calculated from the alignment of reads back to the event breakpoint in the contigs.  The events were then screened against dbSNP and other variation databases to identify putative novel events.

To determine compartment specific events the structural variant calls for each patient from all matched genome and RNA-seq samples were concatenated together and screened against matching genome tumour, and where available germline bam files. This resulted in compartment specific structural variant events and where germline was available putative somatic and germline events. The events were further filtered against a compendium of germline structural variants to remove recurrent false positives.

SNV analysis of strand-specific RNA-seq data:

After repositioning, hg19-aligned BAM files were split into positive-fragment and negative-fragment BAM files based on the orientation of the paired-end reads. Unmapped and improperly paired aligned reads were put into the mix-fragment BAM. SNVs were then detected on positive- and negative-split BAMs separately using SNVMix2 (Goya et al., 2010) with parameters Mb and Q30.  The SNVs were further filtered to exclude those called based on 1) reference base N; 2) only 1 read supports the variant; 3) probability of heterozygous and homozygous of variant allele smaller than 0.99; 4) a position overlapping with insertions or deletions; 5) read supports from positions no more than 5 bases from read ends; 6) supports from reads only spanning an exon-exon junction; 7) more than 0.5 proportion of supporting reads were improper paired; 8) fewer than 2 proper-paired supporting reads.  SNVs located in exons equal or smaller than the read length, 100bp in this case, are a special case, because all their coverage may come from exon-exon junction spanning reads, so we also identified small-exonic SNVs that ware only supported by reads that spanning exon-exon junction but passed all other 7 filtering criteria mentioned above. These SNVs were finally annotated with SnpEff (Cingolani et al., 2012b) (Ensembl 66) and SnpSift (Cingolani et al., 2012a) (dbSNP137 and COSMIC64).

mRNA-Differential expression:

We used SAMseq (samr v2.0, R 2.15.0) two-class unpaired analyses with an FDR threshold of 0.05 to identify genes that were differentially expressed. For each run on a pair of sample groups, we first reduced the number of genes by removing those with median less than 5 RPKM in both groups, and those for which the Wilcoxon BH adjusted P-value between the two groups was greater than 0.05. This subset of genes was submitted to SAMseq. Each run generated a pair of files: genes ‘up’ and ‘down’. We then ranked the genes by a median-based fold change, and generated a figure showing up to 10 of the largest fold changes in each direction.

mRNA-NMF:

For specific mRNA-Seq expression datasets, we first removed genes expressed at or below a noise threshold of ≤ 0.2 reads per kilobase (of gene model) per million mapped reads (RPKM) in at least 75% of samples. We created the NMF input matrix using the top 25% most-variant genes, by ranking expressed genes having a mean RPKM of at least 10 by the coefficient of variation. We generated consensus clustering results with NMF v0.5.02 in R v1.12.0, with the default Brunet algorithm, and 200 iterations for the clustering run. Rank survey profiles for cophenetic and silhouette width suggest a specific cluster solution.

Strand-specific RNA-seq (plate based) library construction:

Total RNA samples were checked using an Agilent Bioanalyzer RNA nanochip or Caliper GX HT RNA LabChip, and samples passing quality control were arrayed into a 96-well plate. PolyA+ RNA was purified using the 96-well MultiMACS mRNA isolation kit on the MultiMACS 96 separator (Miltenyi Biotec, Germany) from 2ug total RNA with on-column DNaseI-treatment as per the manufacturer's instructions. The eluted PolyA+ RNA was ethanol precipitated and resuspended in 10µL of DEPC treated water with 1:20 SuperaseIN (Life Technologies, USA).

 

First-stranded cDNA was synthesized from the purified polyA+RNA using the Superscript cDNA Synthesis kit (Life Technologies, USA) and random hexamer primers at a concentration of 5µM along with a final concentration of 1ug/uL Actinomycin D, followed by Ampure XP SPRI beads on a Biomek FX robot (Beckman-Coulter, USA). The second strand cDNA was synthesized following the Superscript cDNA Synthesis protocol by replacing the dTTP with dUTP in dNTP mix, allowing second strand to be digested using UNG (Uracil-N-Glycosylase, Life Technologies, USA) in the post-adapter ligation reaction and thus achieving strand specificity.

The cDNA was quantified in a 96-well format using PicoGreen (Life Technologies, USA) and VICTOR3V Spectrophotometer (PerkinElmer, Inc. USA). The cDNA was fragmented by Covaris E210 sonication for 55 seconds at a “Duty cycle” of 20% and “Intensity” of 5. The paired-end sequencing library was prepared following the BC Cancer Agency Genome Sciences Centre strand-specific, plate-based and paired-end library construction protocol on a Biomek FX robot (Beckman-Coulter, USA). Briefly, the cDNA was purified in 96-well format using Ampure XP SPRI beads, and was subject to end-repair, and phosphorylation by T4 DNA polymerase, Klenow DNA Polymerase, and T4 polynucleotide kinase respectively in a single reaction, followed by cleanup using Ampure XP SPRI beads and 3’ A-tailing by Klenow fragment (3’ to 5’ exo minus). After purification using Ampure XP SPRI beads, picogreen quantification was performed to determine the amount of Illumina PE adapters to be used in the next step of adapter ligation reaction. The adapter-ligated products were purified using Ampure XP SPRI beads, and digested with UNG (1U/ul) at 37oC for 30 min followed by deactivation at 95oC for 15 min. The digested cDNA was purified using Ampure XP SPRI beads, and then PCR-amplified with Phusion DNA Polymerase (Thermo Fisher Scientific Inc. USA) using Illumina’s PE primer set,  with cycle condition 98˚C  30sec followed by 10-13 cycles of 98˚C  10 sec, 65˚C  30 sec and 72˚C  30 sec, and then 72˚C  5min. The PCR products were purified using Ampure XP SPRI beads, and checked with Caliper LabChip GX for DNA samples using the High Sensitivity Assay (PerkinElmer, Inc. USA). PCR product of the desired size range was purified using 8% PAGE, and the DNA quality was assessed and quantified using an Agilent DNA 1000 series II assay and Quant-iT dsDNA HS Assay Kit using Qubit fluorometer (Invitrogen), then diluted to 8nM. The final library concentration was double checked and determined by Quant-iT dsDNA HS Assay again for Illumina Sequencing.

RNA-Seq/hg19 read alignment:

Illumina paired-end RNA sequencing reads were aligned to GRCh37-lite genome-plus-junctions reference using BWA version 0.5.7. This reference combined genomic sequences in the GRCh37-lite assembly and exon-exon junction sequences whose corresponding coordinates were defined based on annotations of any transcripts in Ensembl (v59), Refseq and known genes from the UCSC genome browser, which was downloaded on August 19 2010, August 8 2010, and August 19 2010, respectively. Reads that mapped to junction regions were then repositioned back to the genome, and were marked with 'ZJ:Z' tags. BWA is run using default parameters, except that the option (-s) is included to disable Smith-Waterman alignment.

Finally, reads failing the Illumina chastity filter are flagged with a custom script, and duplicated reads were flagged with Picard Tools.

 

Structural variant detection

Was performed using ABySS (v1.3.2) and trans-ABySS (v1.4.6). For RNA-seq assembly alternate k-mers from k50-k96 were performed using positive strand and ambiguous stand reads as well as negative strand and ambiguous strand reads. The positive and negative strand assemblies were extended where possible, merged and then concatenated together to produce a meta-assembly contig dataset. The genome (WGS) libraries were assembled in single end mode using k-mer values of k24, and k44. The contigs and reads were then reassembled at k64 in single end mode and then finally at k64 in paired end mode. The meta-assemblies were then used as input to the trans-ABySS analysis pipeline (Robertson et al., 2010).

Large scale rearrangements and gene fusions from RNA-seq libraries were identified from contigs that had high confidence GMAP (v2012-12-20) alignments to two distinct genomic regions.  Evidence for the alignments were provided from aligning reads back to the contigs and from aligning reads to genomic coordinates. Events were then filtered on read thresholds. Large scale rearrangements and gene fusions from WGS libraries were identified in a similar way, but using BWA (v0.6.2-r126) alignments.

Insertions and deletions were identified by gapped alignment of contigs to the human reference using GMAP for RNA-seq and BWA for WGS. Confidence in the event was calculated from the alignment of reads back to the event breakpoint in the contigs.  The events were then screened against dbSNP and other variation databases to identify putative novel events.

To determine compartment specific events the structural variant calls for each patient from all matched genome and RNA-seq samples were concatenated together and screened against matching genome tumour, and where available germline bam files. This resulted in compartment specific structural variant events and where germline was available putative somatic and germline events. The events were further filtered against a compendium of germline structural variants to remove recurrent false positives.

SNV analysis of strand-specific RNA-seq data:

After repositioning, hg19-aligned BAM files were split into positive-fragment and negative-fragment BAM files based on the orientation of the paired-end reads. Unmapped and improperly paired aligned reads were put into the mix-fragment BAM. SNVs were then detected on positive- and negative-split BAMs separately using SNVMix2 (Goya et al., 2010) with parameters Mb and Q30.  The SNVs were further filtered to exclude those called based on 1) reference base N; 2) only 1 read supports the variant; 3) probability of heterozygous and homozygous of variant allele smaller than 0.99; 4) a position overlapping with insertions or deletions; 5) read supports from positions no more than 5 bases from read ends; 6) supports from reads only spanning an exon-exon junction; 7) more than 0.5 proportion of supporting reads were improper paired; 8) fewer than 2 proper-paired supporting reads.  SNVs located in exons equal or smaller than the read length, 100bp in this case, are a special case, because all their coverage may come from exon-exon junction spanning reads, so we also identified small-exonic SNVs that ware only supported by reads that spanning exon-exon junction but passed all other 7 filtering criteria mentioned above. These SNVs were finally annotated with SnpEff (Cingolani et al., 2012b) (Ensembl 66) and SnpSift (Cingolani et al., 2012a) (dbSNP137 and COSMIC64).

mRNA-Differential expression:

We used SAMseq (samr v2.0, R 2.15.0) two-class unpaired analyses with an FDR threshold of 0.05 to identify genes that were differentially expressed. For each run on a pair of sample groups, we first reduced the number of genes by removing those with median less than 5 RPKM in both groups, and those for which the Wilcoxon BH adjusted P-value between the two groups was greater than 0.05. This subset of genes was submitted to SAMseq. Each run generated a pair of files: genes ‘up’ and ‘down’. We then ranked the genes by a median-based fold change, and generated a figure showing up to 10 of the largest fold changes in each direction.

mRNA-NMF:

For specific mRNA-Seq expression datasets, we first removed genes expressed at or below a noise threshold of ≤ 0.2 reads per kilobase (of gene model) per million mapped reads (RPKM) in at least 75% of samples. We created the NMF input matrix using the top 25% most-variant genes, by ranking expressed genes having a mean RPKM of at least 10 by the coefficient of variation. We generated consensus clustering results with NMF v0.5.02 in R v1.12.0, with the default Brunet algorithm, and 200 iterations for the clustering run. Rank survey profiles for cophenetic and silhouette width suggest a specific cluster solution.

Strand-specific RNA-seq (plate based) library construction:

Total RNA samples were checked using an Agilent Bioanalyzer RNA nanochip or Caliper GX HT RNA LabChip, and samples passing quality control were arrayed into a 96-well plate. PolyA+ RNA was purified using the 96-well MultiMACS mRNA isolation kit on the MultiMACS 96 separator (Miltenyi Biotec, Germany) from 2ug total RNA with on-column DNaseI-treatment as per the manufacturer's instructions. The eluted PolyA+ RNA was ethanol precipitated and resuspended in 10µL of DEPC treated water with 1:20 SuperaseIN (Life Technologies, USA).

 

First-stranded cDNA was synthesized from the purified polyA+RNA using the Superscript cDNA Synthesis kit (Life Technologies, USA) and random hexamer primers at a concentration of 5µM along with a final concentration of 1ug/uL Actinomycin D, followed by Ampure XP SPRI beads on a Biomek FX robot (Beckman-Coulter, USA). The second strand cDNA was synthesized following the Superscript cDNA Synthesis protocol by replacing the dTTP with dUTP in dNTP mix, allowing second strand to be digested using UNG (Uracil-N-Glycosylase, Life Technologies, USA) in the post-adapter ligation reaction and thus achieving strand specificity.

The cDNA was quantified in a 96-well format using PicoGreen (Life Technologies, USA) and VICTOR3V Spectrophotometer (PerkinElmer, Inc. USA). The cDNA was fragmented by Covaris E210 sonication for 55 seconds at a “Duty cycle” of 20% and “Intensity” of 5. The paired-end sequencing library was prepared following the BC Cancer Agency Genome Sciences Centre strand-specific, plate-based and paired-end library construction protocol on a Biomek FX robot (Beckman-Coulter, USA). Briefly, the cDNA was purified in 96-well format using Ampure XP SPRI beads, and was subject to end-repair, and phosphorylation by T4 DNA polymerase, Klenow DNA Polymerase, and T4 polynucleotide kinase respectively in a single reaction, followed by cleanup using Ampure XP SPRI beads and 3’ A-tailing by Klenow fragment (3’ to 5’ exo minus). After purification using Ampure XP SPRI beads, picogreen quantification was performed to determine the amount of Illumina PE adapters to be used in the next step of adapter ligation reaction. The adapter-ligated products were purified using Ampure XP SPRI beads, and digested with UNG (1U/ul) at 37oC for 30 min followed by deactivation at 95oC for 15 min. The digested cDNA was purified using Ampure XP SPRI beads, and then PCR-amplified with Phusion DNA Polymerase (Thermo Fisher Scientific Inc. USA) using Illumina’s PE primer set,  with cycle condition 98˚C  30sec followed by 10-13 cycles of 98˚C  10 sec, 65˚C  30 sec and 72˚C  30 sec, and then 72˚C  5min. The PCR products were purified using Ampure XP SPRI beads, and checked with Caliper LabChip GX for DNA samples using the High Sensitivity Assay (PerkinElmer, Inc. USA). PCR product of the desired size range was purified using 8% PAGE, and the DNA quality was assessed and quantified using an Agilent DNA 1000 series II assay and Quant-iT dsDNA HS Assay Kit using Qubit fluorometer (Invitrogen), then diluted to 8nM. The final library concentration was double checked and determined by Quant-iT dsDNA HS Assay again for Illumina Sequencing.

RNA-Seq/hg19 read alignment:

Illumina paired-end RNA sequencing reads were aligned to GRCh37-lite genome-plus-junctions reference using BWA version 0.5.7. This reference combined genomic sequences in the GRCh37-lite assembly and exon-exon junction sequences whose corresponding coordinates were defined based on annotations of any transcripts in Ensembl (v59), Refseq and known genes from the UCSC genome browser, which was downloaded on August 19 2010, August 8 2010, and August 19 2010, respectively. Reads that mapped to junction regions were then repositioned back to the genome, and were marked with 'ZJ:Z' tags. BWA is run using default parameters, except that the option (-s) is included to disable Smith-Waterman alignment.

Finally, reads failing the Illumina chastity filter are flagged with a custom script, and duplicated reads were flagged with Picard Tools.

 

Structural variant detection

Was performed using ABySS (v1.3.2) and trans-ABySS (v1.4.6). For RNA-seq assembly alternate k-mers from k50-k96 were performed using positive strand and ambiguous stand reads as well as negative strand and ambiguous strand reads. The positive and negative strand assemblies were extended where possible, merged and then concatenated together to produce a meta-assembly contig dataset. The genome (WGS) libraries were assembled in single end mode using k-mer values of k24, and k44. The contigs and reads were then reassembled at k64 in single end mode and then finally at k64 in paired end mode. The meta-assemblies were then used as input to the trans-ABySS analysis pipeline (Robertson et al., 2010).

Large scale rearrangements and gene fusions from RNA-seq libraries were identified from contigs that had high confidence GMAP (v2012-12-20) alignments to two distinct genomic regions.  Evidence for the alignments were provided from aligning reads back to the contigs and from aligning reads to genomic coordinates. Events were then filtered on read thresholds. Large scale rearrangements and gene fusions from WGS libraries were identified in a similar way, but using BWA (v0.6.2-r126) alignments.

Insertions and deletions were identified by gapped alignment of contigs to the human reference using GMAP for RNA-seq and BWA for WGS. Confidence in the event was calculated from the alignment of reads back to the event breakpoint in the contigs.  The events were then screened against dbSNP and other variation databases to identify putative novel events.

To determine compartment specific events the structural variant calls for each patient from all matched genome and RNA-seq samples were concatenated together and screened against matching genome tumour, and where available germline bam files. This resulted in compartment specific structural variant events and where germline was available putative somatic and germline events. The events were further filtered against a compendium of germline structural variants to remove recurrent false positives.

SNV analysis of strand-specific RNA-seq data:

After repositioning, hg19-aligned BAM files were split into positive-fragment and negative-fragment BAM files based on the orientation of the paired-end reads. Unmapped and improperly paired aligned reads were put into the mix-fragment BAM. SNVs were then detected on positive- and negative-split BAMs separately using SNVMix2 (Goya et al., 2010) with parameters Mb and Q30.  The SNVs were further filtered to exclude those called based on 1) reference base N; 2) only 1 read supports the variant; 3) probability of heterozygous and homozygous of variant allele smaller than 0.99; 4) a position overlapping with insertions or deletions; 5) read supports from positions no more than 5 bases from read ends; 6) supports from reads only spanning an exon-exon junction; 7) more than 0.5 proportion of supporting reads were improper paired; 8) fewer than 2 proper-paired supporting reads.  SNVs located in exons equal or smaller than the read length, 100bp in this case, are a special case, because all their coverage may come from exon-exon junction spanning reads, so we also identified small-exonic SNVs that ware only supported by reads that spanning exon-exon junction but passed all other 7 filtering criteria mentioned above. These SNVs were finally annotated with SnpEff (Cingolani et al., 2012b) (Ensembl 66) and SnpSift (Cingolani et al., 2012a) (dbSNP137 and COSMIC64).

mRNA-Differential expression:

We used SAMseq (samr v2.0, R 2.15.0) two-class unpaired analyses with an FDR threshold of 0.05 to identify genes that were differentially expressed. For each run on a pair of sample groups, we first reduced the number of genes by removing those with median less than 5 RPKM in both groups, and those for which the Wilcoxon BH adjusted P-value between the two groups was greater than 0.05. This subset of genes was submitted to SAMseq. Each run generated a pair of files: genes ‘up’ and ‘down’. We then ranked the genes by a median-based fold change, and generated a figure showing up to 10 of the largest fold changes in each direction.

mRNA-NMF:

For specific mRNA-Seq expression datasets, we first removed genes expressed at or below a noise threshold of ≤ 0.2 reads per kilobase (of gene model) per million mapped reads (RPKM) in at least 75% of samples. We created the NMF input matrix using the top 25% most-variant genes, by ranking expressed genes having a mean RPKM of at least 10 by the coefficient of variation. We generated consensus clustering results with NMF v0.5.02 in R v1.12.0, with the default Brunet algorithm, and 200 iterations for the clustering run. Rank survey profiles for cophenetic and silhouette width suggest a specific cluster solution.

Strand-specific RNA-seq (plate based) library construction:

Total RNA samples were checked using an Agilent Bioanalyzer RNA nanochip or Caliper GX HT RNA LabChip, and samples passing quality control were arrayed into a 96-well plate. PolyA+ RNA was purified using the 96-well MultiMACS mRNA isolation kit on the MultiMACS 96 separator (Miltenyi Biotec, Germany) from 2ug total RNA with on-column DNaseI-treatment as per the manufacturer's instructions. The eluted PolyA+ RNA was ethanol precipitated and resuspended in 10µL of DEPC treated water with 1:20 SuperaseIN (Life Technologies, USA).

 

First-stranded cDNA was synthesized from the purified polyA+RNA using the Superscript cDNA Synthesis kit (Life Technologies, USA) and random hexamer primers at a concentration of 5µM along with a final concentration of 1ug/uL Actinomycin D, followed by Ampure XP SPRI beads on a Biomek FX robot (Beckman-Coulter, USA). The second strand cDNA was synthesized following the Superscript cDNA Synthesis protocol by replacing the dTTP with dUTP in dNTP mix, allowing second strand to be digested using UNG (Uracil-N-Glycosylase, Life Technologies, USA) in the post-adapter ligation reaction and thus achieving strand specificity.

The cDNA was quantified in a 96-well format using PicoGreen (Life Technologies, USA) and VICTOR3V Spectrophotometer (PerkinElmer, Inc. USA). The cDNA was fragmented by Covaris E210 sonication for 55 seconds at a “Duty cycle” of 20% and “Intensity” of 5. The paired-end sequencing library was prepared following the BC Cancer Agency Genome Sciences Centre strand-specific, plate-based and paired-end library construction protocol on a Biomek FX robot (Beckman-Coulter, USA). Briefly, the cDNA was purified in 96-well format using Ampure XP SPRI beads, and was subject to end-repair, and phosphorylation by T4 DNA polymerase, Klenow DNA Polymerase, and T4 polynucleotide kinase respectively in a single reaction, followed by cleanup using Ampure XP SPRI beads and 3’ A-tailing by Klenow fragment (3’ to 5’ exo minus). After purification using Ampure XP SPRI beads, picogreen quantification was performed to determine the amount of Illumina PE adapters to be used in the next step of adapter ligation reaction. The adapter-ligated products were purified using Ampure XP SPRI beads, and digested with UNG (1U/ul) at 37oC for 30 min followed by deactivation at 95oC for 15 min. The digested cDNA was purified using Ampure XP SPRI beads, and then PCR-amplified with Phusion DNA Polymerase (Thermo Fisher Scientific Inc. USA) using Illumina’s PE primer set,  with cycle condition 98˚C  30sec followed by 10-13 cycles of 98˚C  10 sec, 65˚C  30 sec and 72˚C  30 sec, and then 72˚C  5min. The PCR products were purified using Ampure XP SPRI beads, and checked with Caliper LabChip GX for DNA samples using the High Sensitivity Assay (PerkinElmer, Inc. USA). PCR product of the desired size range was purified using 8% PAGE, and the DNA quality was assessed and quantified using an Agilent DNA 1000 series II assay and Quant-iT dsDNA HS Assay Kit using Qubit fluorometer (Invitrogen), then diluted to 8nM. The final library concentration was double checked and determined by Quant-iT dsDNA HS Assay again for Illumina Sequencing.

Strand-specific ribodepletion RNA sequencing:

Enzymatic reactions were set-up in a 96-well plate (Thermo Fisher Scientific) on a Microlab NIMBUS liquid handler (Hamilton Robotics, USA). 100ng of DNase I treated total RNA in 6 µL was hybridized to rRNA probes in a 7.5 µL reaction. Heat-sealed plates were incubated at 95oC for 2 minutes followed by incremental reduction in temperature by 0.1oC per second to 22oC (730 cycles). The rRNA in DNA hybrids were digested using RNase H in a 10 µL reaction incubated in a thermocycler at 37oC for 30 minutes. To remove excess rRNA probes (DNA) and residual genomic DNA contamination, DNase I was added in a total reaction volume of 25 µL and incubated at 37oC for 30 minutes. RNA was purified using RNA MagClean DX beads (Aline Biosciences, USA) with 15 minutes of binding time, 7 minutes clearing on a magnet followed by two 70% ethanol washes, 5 minutes to air dry the RNA pellet and elution in 36uL DEPC water. The plate containing RNA was stored at -80oC prior to cDNA synthesis.

First-strand cDNA was synthesized from the purified RNA (minus rRNA) using the Maxima H Minus First Strand cDNA Synthesis kit (Thermo-Fisher, USA) and random hexamer primers at a concentration of 8ng/µL along with a final concentration of 0.4µg/µL Actinomycin D, followed by PCR Clean DX bead purification on a Microlab NIMBUS robot (Hamilton Robotics, USA). The second strand cDNA was synthesized following the NEBNext Ultra Directional Second Strand cDNA Synthesis protocol (NEB) that incorporates dUTP in the dNTP mix, allowing the second strand to be digested using USERTM enzyme (NEB) in the post-adapter ligation reaction and thus achieving strand specificity.

cDNA was fragmented by Covaris LE220 sonication for 130seconds (2x65seconds) at a “Duty cycle” of 30%, 450 Peak Incident Power (W) and 200 Cycles per Burst in a 96-well microTUBE Plate (P/N: 520078) to achieve 200-250 bp average fragment lengths. The paired-end sequencing library was prepared following the BC Cancer Agency Genome Sciences Centre strand-specific, plate-based library construction protocol on a Microlab NIMBUS robot (Hamilton Robotics, USA). Briefly, the sheared cDNA was subject to end-repair and phosphorylation in a single reaction using an enzyme premix (NEB) containing T4 DNA polymerase, Klenow DNA Polymerase and T4 polynucleotide kinase, incubated at 20oC for 30 minutes. Repaired cDNA was purified in 96-well format using PCR Clean DX beads (Aline Biosciences, USA), and 3’ A-tailed (adenylation) using Klenow fragment (3’ to 5’ exo minus) and incubation at 37oC for 30 minutes prior to enzyme heat inactivation. Illumina PE adapters were ligated at 20oC for 15 minutes. The adapter-ligated products were purified using PCR Clean DX beads, then digested with USERTM enzyme (1 U/µL, NEB) at 37oC for 15 minutes followed immediately by 13 cycles of indexed PCR using Phusion DNA Polymerase (Thermo Fisher Scientific Inc. USA) and Illumina’s PE primer set. PCR parameters: 98˚C for 1 minute followed by 13 cycles of 98˚C 15 seconds, 65˚C 30 seconds and 72˚C 30 seconds, and then 72˚C 5 minutes. The PCR products were purified and size-selected using a 1:1 PCR Clean DX beads-to-sample ratio (twice), and the eluted DNA quality was assessed with Caliper LabChip GX for DNA samples using the High Sensitivity Assay (PerkinElmer, Inc. USA) and quantified using a Quant-iT dsDNA High Sensitivity Assay Kit on a Qubit fluorometer (Invitrogen) prior to library pooling and size-corrected final molar concentration calculation for Illumina HiSeq2500 sequencing with paired-end 75 base reads.

Strand-specific ribodepletion RNA sequencing:

To remove cytoplasmic and mitochondrial ribosomal RNA (rRNA) species from total RNA NEBNext rRNA Depletion Kit for Human/Mouse/Rat was used (NEB, E6310X).

RNA-Seq/hg19 read alignment:

Illumina paired-end RNA sequencing reads were aligned to GRCh37-lite genome-plus-junctions reference using BWA version 0.5.7. This reference combined genomic sequences in the GRCh37-lite assembly and exon-exon junction sequences whose corresponding coordinates were defined based on annotations of any transcripts in Ensembl (v59), Refseq and known genes from the UCSC genome browser, which was downloaded on August 19 2010, August 8 2010, and August 19 2010, respectively. Reads that mapped to junction regions were then repositioned back to the genome, and were marked with 'ZJ:Z' tags. BWA is run using default parameters, except that the option (-s) is included to disable Smith-Waterman alignment.

Finally, reads failing the Illumina chastity filter are flagged with a custom script, and duplicated reads were flagged with Picard Tools.

 

Structural variant detection

Was performed using ABySS (v1.3.2) and trans-ABySS (v1.4.6). For RNA-seq assembly alternate k-mers from k50-k96 were performed using positive strand and ambiguous stand reads as well as negative strand and ambiguous strand reads. The positive and negative strand assemblies were extended where possible, merged and then concatenated together to produce a meta-assembly contig dataset. The genome (WGS) libraries were assembled in single end mode using k-mer values of k24, and k44. The contigs and reads were then reassembled at k64 in single end mode and then finally at k64 in paired end mode. The meta-assemblies were then used as input to the trans-ABySS analysis pipeline (Robertson et al., 2010).

Large scale rearrangements and gene fusions from RNA-seq libraries were identified from contigs that had high confidence GMAP (v2012-12-20) alignments to two distinct genomic regions.  Evidence for the alignments were provided from aligning reads back to the contigs and from aligning reads to genomic coordinates. Events were then filtered on read thresholds. Large scale rearrangements and gene fusions from WGS libraries were identified in a similar way, but using BWA (v0.6.2-r126) alignments.

Insertions and deletions were identified by gapped alignment of contigs to the human reference using GMAP for RNA-seq and BWA for WGS. Confidence in the event was calculated from the alignment of reads back to the event breakpoint in the contigs.  The events were then screened against dbSNP and other variation databases to identify putative novel events.

To determine compartment specific events the structural variant calls for each patient from all matched genome and RNA-seq samples were concatenated together and screened against matching genome tumour, and where available germline bam files. This resulted in compartment specific structural variant events and where germline was available putative somatic and germline events. The events were further filtered against a compendium of germline structural variants to remove recurrent false positives.

SNV analysis of strand-specific RNA-seq data:

After repositioning, hg19-aligned BAM files were split into positive-fragment and negative-fragment BAM files based on the orientation of the paired-end reads. Unmapped and improperly paired aligned reads were put into the mix-fragment BAM. SNVs were then detected on positive- and negative-split BAMs separately using SNVMix2 (Goya et al., 2010) with parameters Mb and Q30.  The SNVs were further filtered to exclude those called based on 1) reference base N; 2) only 1 read supports the variant; 3) probability of heterozygous and homozygous of variant allele smaller than 0.99; 4) a position overlapping with insertions or deletions; 5) read supports from positions no more than 5 bases from read ends; 6) supports from reads only spanning an exon-exon junction; 7) more than 0.5 proportion of supporting reads were improper paired; 8) fewer than 2 proper-paired supporting reads.  SNVs located in exons equal or smaller than the read length, 100bp in this case, are a special case, because all their coverage may come from exon-exon junction spanning reads, so we also identified small-exonic SNVs that ware only supported by reads that spanning exon-exon junction but passed all other 7 filtering criteria mentioned above. These SNVs were finally annotated with SnpEff (Cingolani et al., 2012b) (Ensembl 66) and SnpSift (Cingolani et al., 2012a) (dbSNP137 and COSMIC64).

mRNA-Differential expression:

We used SAMseq (samr v2.0, R 2.15.0) two-class unpaired analyses with an FDR threshold of 0.05 to identify genes that were differentially expressed. For each run on a pair of sample groups, we first reduced the number of genes by removing those with median less than 5 RPKM in both groups, and those for which the Wilcoxon BH adjusted P-value between the two groups was greater than 0.05. This subset of genes was submitted to SAMseq. Each run generated a pair of files: genes ‘up’ and ‘down’. We then ranked the genes by a median-based fold change, and generated a figure showing up to 10 of the largest fold changes in each direction.

mRNA-NMF:

For specific mRNA-Seq expression datasets, we first removed genes expressed at or below a noise threshold of ≤ 0.2 reads per kilobase (of gene model) per million mapped reads (RPKM) in at least 75% of samples. We created the NMF input matrix using the top 25% most-variant genes, by ranking expressed genes having a mean RPKM of at least 10 by the coefficient of variation. We generated consensus clustering results with NMF v0.5.02 in R v1.12.0, with the default Brunet algorithm, and 200 iterations for the clustering run. Rank survey profiles for cophenetic and silhouette width suggest a specific cluster solution.

Strand-specific ribodepletion RNA sequencing:

To remove cytoplasmic and mitochondrial ribosomal RNA (rRNA) species from total RNA NEBNext rRNA Depletion Kit for Human/Mouse/Rat was used (NEB, E6310X).

RNA-Seq/hg19 read alignment:

Illumina paired-end RNA sequencing reads were aligned to GRCh37-lite genome-plus-junctions reference using BWA version 0.5.7. This reference combined genomic sequences in the GRCh37-lite assembly and exon-exon junction sequences whose corresponding coordinates were defined based on annotations of any transcripts in Ensembl (v59), Refseq and known genes from the UCSC genome browser, which was downloaded on August 19 2010, August 8 2010, and August 19 2010, respectively. Reads that mapped to junction regions were then repositioned back to the genome, and were marked with 'ZJ:Z' tags. BWA is run using default parameters, except that the option (-s) is included to disable Smith-Waterman alignment.

Finally, reads failing the Illumina chastity filter are flagged with a custom script, and duplicated reads were flagged with Picard Tools.

 

Structural variant detection

Was performed using ABySS (v1.3.2) and trans-ABySS (v1.4.6). For RNA-seq assembly alternate k-mers from k50-k96 were performed using positive strand and ambiguous stand reads as well as negative strand and ambiguous strand reads. The positive and negative strand assemblies were extended where possible, merged and then concatenated together to produce a meta-assembly contig dataset. The genome (WGS) libraries were assembled in single end mode using k-mer values of k24, and k44. The contigs and reads were then reassembled at k64 in single end mode and then finally at k64 in paired end mode. The meta-assemblies were then used as input to the trans-ABySS analysis pipeline (Robertson et al., 2010).

Large scale rearrangements and gene fusions from RNA-seq libraries were identified from contigs that had high confidence GMAP (v2012-12-20) alignments to two distinct genomic regions.  Evidence for the alignments were provided from aligning reads back to the contigs and from aligning reads to genomic coordinates. Events were then filtered on read thresholds. Large scale rearrangements and gene fusions from WGS libraries were identified in a similar way, but using BWA (v0.6.2-r126) alignments.

Insertions and deletions were identified by gapped alignment of contigs to the human reference using GMAP for RNA-seq and BWA for WGS. Confidence in the event was calculated from the alignment of reads back to the event breakpoint in the contigs.  The events were then screened against dbSNP and other variation databases to identify putative novel events.

To determine compartment specific events the structural variant calls for each patient from all matched genome and RNA-seq samples were concatenated together and screened against matching genome tumour, and where available germline bam files. This resulted in compartment specific structural variant events and where germline was available putative somatic and germline events. The events were further filtered against a compendium of germline structural variants to remove recurrent false positives.

SNV analysis of strand-specific RNA-seq data:

After repositioning, hg19-aligned BAM files were split into positive-fragment and negative-fragment BAM files based on the orientation of the paired-end reads. Unmapped and improperly paired aligned reads were put into the mix-fragment BAM. SNVs were then detected on positive- and negative-split BAMs separately using SNVMix2 (Goya et al., 2010) with parameters Mb and Q30.  The SNVs were further filtered to exclude those called based on 1) reference base N; 2) only 1 read supports the variant; 3) probability of heterozygous and homozygous of variant allele smaller than 0.99; 4) a position overlapping with insertions or deletions; 5) read supports from positions no more than 5 bases from read ends; 6) supports from reads only spanning an exon-exon junction; 7) more than 0.5 proportion of supporting reads were improper paired; 8) fewer than 2 proper-paired supporting reads.  SNVs located in exons equal or smaller than the read length, 100bp in this case, are a special case, because all their coverage may come from exon-exon junction spanning reads, so we also identified small-exonic SNVs that ware only supported by reads that spanning exon-exon junction but passed all other 7 filtering criteria mentioned above. These SNVs were finally annotated with SnpEff (Cingolani et al., 2012b) (Ensembl 66) and SnpSift (Cingolani et al., 2012a) (dbSNP137 and COSMIC64).

mRNA-Differential expression:

We used SAMseq (samr v2.0, R 2.15.0) two-class unpaired analyses with an FDR threshold of 0.05 to identify genes that were differentially expressed. For each run on a pair of sample groups, we first reduced the number of genes by removing those with median less than 5 RPKM in both groups, and those for which the Wilcoxon BH adjusted P-value between the two groups was greater than 0.05. This subset of genes was submitted to SAMseq. Each run generated a pair of files: genes ‘up’ and ‘down’. We then ranked the genes by a median-based fold change, and generated a figure showing up to 10 of the largest fold changes in each direction.

mRNA-NMF:

For specific mRNA-Seq expression datasets, we first removed genes expressed at or below a noise threshold of ≤ 0.2 reads per kilobase (of gene model) per million mapped reads (RPKM) in at least 75% of samples. We created the NMF input matrix using the top 25% most-variant genes, by ranking expressed genes having a mean RPKM of at least 10 by the coefficient of variation. We generated consensus clustering results with NMF v0.5.02 in R v1.12.0, with the default Brunet algorithm, and 200 iterations for the clustering run. Rank survey profiles for cophenetic and silhouette width suggest a specific cluster solution.

Strand-specific ribodepletion RNA sequencing:

To remove cytoplasmic and mitochondrial ribosomal RNA (rRNA) species from total RNA NEBNext rRNA Depletion Kit for Human/Mouse/Rat was used (NEB, E6310X).

RNA-Seq/hg19 read alignment:

Illumina paired-end RNA sequencing reads were aligned to GRCh37-lite genome-plus-junctions reference using BWA version 0.5.7. This reference combined genomic sequences in the GRCh37-lite assembly and exon-exon junction sequences whose corresponding coordinates were defined based on annotations of any transcripts in Ensembl (v59), Refseq and known genes from the UCSC genome browser, which was downloaded on August 19 2010, August 8 2010, and August 19 2010, respectively. Reads that mapped to junction regions were then repositioned back to the genome, and were marked with 'ZJ:Z' tags. BWA is run using default parameters, except that the option (-s) is included to disable Smith-Waterman alignment.

Finally, reads failing the Illumina chastity filter are flagged with a custom script, and duplicated reads were flagged with Picard Tools.

 

Structural variant detection

Was performed using ABySS (v1.3.2) and trans-ABySS (v1.4.6). For RNA-seq assembly alternate k-mers from k50-k96 were performed using positive strand and ambiguous stand reads as well as negative strand and ambiguous strand reads. The positive and negative strand assemblies were extended where possible, merged and then concatenated together to produce a meta-assembly contig dataset. The genome (WGS) libraries were assembled in single end mode using k-mer values of k24, and k44. The contigs and reads were then reassembled at k64 in single end mode and then finally at k64 in paired end mode. The meta-assemblies were then used as input to the trans-ABySS analysis pipeline (Robertson et al., 2010).

Large scale rearrangements and gene fusions from RNA-seq libraries were identified from contigs that had high confidence GMAP (v2012-12-20) alignments to two distinct genomic regions.  Evidence for the alignments were provided from aligning reads back to the contigs and from aligning reads to genomic coordinates. Events were then filtered on read thresholds. Large scale rearrangements and gene fusions from WGS libraries were identified in a similar way, but using BWA (v0.6.2-r126) alignments.

Insertions and deletions were identified by gapped alignment of contigs to the human reference using GMAP for RNA-seq and BWA for WGS. Confidence in the event was calculated from the alignment of reads back to the event breakpoint in the contigs.  The events were then screened against dbSNP and other variation databases to identify putative novel events.

To determine compartment specific events the structural variant calls for each patient from all matched genome and RNA-seq samples were concatenated together and screened against matching genome tumour, and where available germline bam files. This resulted in compartment specific structural variant events and where germline was available putative somatic and germline events. The events were further filtered against a compendium of germline structural variants to remove recurrent false positives.

SNV analysis of strand-specific RNA-seq data:

After repositioning, hg19-aligned BAM files were split into positive-fragment and negative-fragment BAM files based on the orientation of the paired-end reads. Unmapped and improperly paired aligned reads were put into the mix-fragment BAM. SNVs were then detected on positive- and negative-split BAMs separately using SNVMix2 (Goya et al., 2010) with parameters Mb and Q30.  The SNVs were further filtered to exclude those called based on 1) reference base N; 2) only 1 read supports the variant; 3) probability of heterozygous and homozygous of variant allele smaller than 0.99; 4) a position overlapping with insertions or deletions; 5) read supports from positions no more than 5 bases from read ends; 6) supports from reads only spanning an exon-exon junction; 7) more than 0.5 proportion of supporting reads were improper paired; 8) fewer than 2 proper-paired supporting reads.  SNVs located in exons equal or smaller than the read length, 100bp in this case, are a special case, because all their coverage may come from exon-exon junction spanning reads, so we also identified small-exonic SNVs that ware only supported by reads that spanning exon-exon junction but passed all other 7 filtering criteria mentioned above. These SNVs were finally annotated with SnpEff (Cingolani et al., 2012b) (Ensembl 66) and SnpSift (Cingolani et al., 2012a) (dbSNP137 and COSMIC64).

mRNA-Differential expression:

We used SAMseq (samr v2.0, R 2.15.0) two-class unpaired analyses with an FDR threshold of 0.05 to identify genes that were differentially expressed. For each run on a pair of sample groups, we first reduced the number of genes by removing those with median less than 5 RPKM in both groups, and those for which the Wilcoxon BH adjusted P-value between the two groups was greater than 0.05. This subset of genes was submitted to SAMseq. Each run generated a pair of files: genes ‘up’ and ‘down’. We then ranked the genes by a median-based fold change, and generated a figure showing up to 10 of the largest fold changes in each direction.

mRNA-NMF:

For specific mRNA-Seq expression datasets, we first removed genes expressed at or below a noise threshold of ≤ 0.2 reads per kilobase (of gene model) per million mapped reads (RPKM) in at least 75% of samples. We created the NMF input matrix using the top 25% most-variant genes, by ranking expressed genes having a mean RPKM of at least 10 by the coefficient of variation. We generated consensus clustering results with NMF v0.5.02 in R v1.12.0, with the default Brunet algorithm, and 200 iterations for the clustering run. Rank survey profiles for cophenetic and silhouette width suggest a specific cluster solution.

Strand-specific ribodepletion RNA sequencing:

To remove cytoplasmic and mitochondrial ribosomal RNA (rRNA) species from total RNA NEBNext rRNA Depletion Kit for Human/Mouse/Rat was used (NEB, E6310X).

RNA-Seq/hg19 read alignment:

Illumina paired-end RNA sequencing reads were aligned to GRCh37-lite genome-plus-junctions reference using BWA version 0.5.7. This reference combined genomic sequences in the GRCh37-lite assembly and exon-exon junction sequences whose corresponding coordinates were defined based on annotations of any transcripts in Ensembl (v59), Refseq and known genes from the UCSC genome browser, which was downloaded on August 19 2010, August 8 2010, and August 19 2010, respectively. Reads that mapped to junction regions were then repositioned back to the genome, and were marked with 'ZJ:Z' tags. BWA is run using default parameters, except that the option (-s) is included to disable Smith-Waterman alignment.

Finally, reads failing the Illumina chastity filter are flagged with a custom script, and duplicated reads were flagged with Picard Tools.

 

Structural variant detection

Was performed using ABySS (v1.3.2) and trans-ABySS (v1.4.6). For RNA-seq assembly alternate k-mers from k50-k96 were performed using positive strand and ambiguous stand reads as well as negative strand and ambiguous strand reads. The positive and negative strand assemblies were extended where possible, merged and then concatenated together to produce a meta-assembly contig dataset. The genome (WGS) libraries were assembled in single end mode using k-mer values of k24, and k44. The contigs and reads were then reassembled at k64 in single end mode and then finally at k64 in paired end mode. The meta-assemblies were then used as input to the trans-ABySS analysis pipeline (Robertson et al., 2010).

Large scale rearrangements and gene fusions from RNA-seq libraries were identified from contigs that had high confidence GMAP (v2012-12-20) alignments to two distinct genomic regions.  Evidence for the alignments were provided from aligning reads back to the contigs and from aligning reads to genomic coordinates. Events were then filtered on read thresholds. Large scale rearrangements and gene fusions from WGS libraries were identified in a similar way, but using BWA (v0.6.2-r126) alignments.

Insertions and deletions were identified by gapped alignment of contigs to the human reference using GMAP for RNA-seq and BWA for WGS. Confidence in the event was calculated from the alignment of reads back to the event breakpoint in the contigs.  The events were then screened against dbSNP and other variation databases to identify putative novel events.

To determine compartment specific events the structural variant calls for each patient from all matched genome and RNA-seq samples were concatenated together and screened against matching genome tumour, and where available germline bam files. This resulted in compartment specific structural variant events and where germline was available putative somatic and germline events. The events were further filtered against a compendium of germline structural variants to remove recurrent false positives.

SNV analysis of strand-specific RNA-seq data:

After repositioning, hg19-aligned BAM files were split into positive-fragment and negative-fragment BAM files based on the orientation of the paired-end reads. Unmapped and improperly paired aligned reads were put into the mix-fragment BAM. SNVs were then detected on positive- and negative-split BAMs separately using SNVMix2 (Goya et al., 2010) with parameters Mb and Q30.  The SNVs were further filtered to exclude those called based on 1) reference base N; 2) only 1 read supports the variant; 3) probability of heterozygous and homozygous of variant allele smaller than 0.99; 4) a position overlapping with insertions or deletions; 5) read supports from positions no more than 5 bases from read ends; 6) supports from reads only spanning an exon-exon junction; 7) more than 0.5 proportion of supporting reads were improper paired; 8) fewer than 2 proper-paired supporting reads.  SNVs located in exons equal or smaller than the read length, 100bp in this case, are a special case, because all their coverage may come from exon-exon junction spanning reads, so we also identified small-exonic SNVs that ware only supported by reads that spanning exon-exon junction but passed all other 7 filtering criteria mentioned above. These SNVs were finally annotated with SnpEff (Cingolani et al., 2012b) (Ensembl 66) and SnpSift (Cingolani et al., 2012a) (dbSNP137 and COSMIC64).

mRNA-Differential expression:

We used SAMseq (samr v2.0, R 2.15.0) two-class unpaired analyses with an FDR threshold of 0.05 to identify genes that were differentially expressed. For each run on a pair of sample groups, we first reduced the number of genes by removing those with median less than 5 RPKM in both groups, and those for which the Wilcoxon BH adjusted P-value between the two groups was greater than 0.05. This subset of genes was submitted to SAMseq. Each run generated a pair of files: genes ‘up’ and ‘down’. We then ranked the genes by a median-based fold change, and generated a figure showing up to 10 of the largest fold changes in each direction.

mRNA-NMF:

For specific mRNA-Seq expression datasets, we first removed genes expressed at or below a noise threshold of ≤ 0.2 reads per kilobase (of gene model) per million mapped reads (RPKM) in at least 75% of samples. We created the NMF input matrix using the top 25% most-variant genes, by ranking expressed genes having a mean RPKM of at least 10 by the coefficient of variation. We generated consensus clustering results with NMF v0.5.02 in R v1.12.0, with the default Brunet algorithm, and 200 iterations for the clustering run. Rank survey profiles for cophenetic and silhouette width suggest a specific cluster solution.

Strand-specific ribodepletion RNA sequencing:

To remove cytoplasmic and mitochondrial ribosomal RNA (rRNA) species from total RNA NEBNext rRNA Depletion Kit for Human/Mouse/Rat was used (NEB, E6310X).

RNA-Seq/hg19 read alignment:

Illumina paired-end RNA sequencing reads were aligned to GRCh37-lite genome-plus-junctions reference using BWA version 0.5.7. This reference combined genomic sequences in the GRCh37-lite assembly and exon-exon junction sequences whose corresponding coordinates were defined based on annotations of any transcripts in Ensembl (v59), Refseq and known genes from the UCSC genome browser, which was downloaded on August 19 2010, August 8 2010, and August 19 2010, respectively. Reads that mapped to junction regions were then repositioned back to the genome, and were marked with 'ZJ:Z' tags. BWA is run using default parameters, except that the option (-s) is included to disable Smith-Waterman alignment.

Finally, reads failing the Illumina chastity filter are flagged with a custom script, and duplicated reads were flagged with Picard Tools.

 

Structural variant detection

Was performed using ABySS (v1.3.2) and trans-ABySS (v1.4.6). For RNA-seq assembly alternate k-mers from k50-k96 were performed using positive strand and ambiguous stand reads as well as negative strand and ambiguous strand reads. The positive and negative strand assemblies were extended where possible, merged and then concatenated together to produce a meta-assembly contig dataset. The genome (WGS) libraries were assembled in single end mode using k-mer values of k24, and k44. The contigs and reads were then reassembled at k64 in single end mode and then finally at k64 in paired end mode. The meta-assemblies were then used as input to the trans-ABySS analysis pipeline (Robertson et al., 2010).

Large scale rearrangements and gene fusions from RNA-seq libraries were identified from contigs that had high confidence GMAP (v2012-12-20) alignments to two distinct genomic regions.  Evidence for the alignments were provided from aligning reads back to the contigs and from aligning reads to genomic coordinates. Events were then filtered on read thresholds. Large scale rearrangements and gene fusions from WGS libraries were identified in a similar way, but using BWA (v0.6.2-r126) alignments.

Insertions and deletions were identified by gapped alignment of contigs to the human reference using GMAP for RNA-seq and BWA for WGS. Confidence in the event was calculated from the alignment of reads back to the event breakpoint in the contigs.  The events were then screened against dbSNP and other variation databases to identify putative novel events.

To determine compartment specific events the structural variant calls for each patient from all matched genome and RNA-seq samples were concatenated together and screened against matching genome tumour, and where available germline bam files. This resulted in compartment specific structural variant events and where germline was available putative somatic and germline events. The events were further filtered against a compendium of germline structural variants to remove recurrent false positives.

SNV analysis of strand-specific RNA-seq data:

After repositioning, hg19-aligned BAM files were split into positive-fragment and negative-fragment BAM files based on the orientation of the paired-end reads. Unmapped and improperly paired aligned reads were put into the mix-fragment BAM. SNVs were then detected on positive- and negative-split BAMs separately using SNVMix2 (Goya et al., 2010) with parameters Mb and Q30.  The SNVs were further filtered to exclude those called based on 1) reference base N; 2) only 1 read supports the variant; 3) probability of heterozygous and homozygous of variant allele smaller than 0.99; 4) a position overlapping with insertions or deletions; 5) read supports from positions no more than 5 bases from read ends; 6) supports from reads only spanning an exon-exon junction; 7) more than 0.5 proportion of supporting reads were improper paired; 8) fewer than 2 proper-paired supporting reads.  SNVs located in exons equal or smaller than the read length, 100bp in this case, are a special case, because all their coverage may come from exon-exon junction spanning reads, so we also identified small-exonic SNVs that ware only supported by reads that spanning exon-exon junction but passed all other 7 filtering criteria mentioned above. These SNVs were finally annotated with SnpEff (Cingolani et al., 2012b) (Ensembl 66) and SnpSift (Cingolani et al., 2012a) (dbSNP137 and COSMIC64).

mRNA-Differential expression:

We used SAMseq (samr v2.0, R 2.15.0) two-class unpaired analyses with an FDR threshold of 0.05 to identify genes that were differentially expressed. For each run on a pair of sample groups, we first reduced the number of genes by removing those with median less than 5 RPKM in both groups, and those for which the Wilcoxon BH adjusted P-value between the two groups was greater than 0.05. This subset of genes was submitted to SAMseq. Each run generated a pair of files: genes ‘up’ and ‘down’. We then ranked the genes by a median-based fold change, and generated a figure showing up to 10 of the largest fold changes in each direction.

mRNA-NMF:

For specific mRNA-Seq expression datasets, we first removed genes expressed at or below a noise threshold of ≤ 0.2 reads per kilobase (of gene model) per million mapped reads (RPKM) in at least 75% of samples. We created the NMF input matrix using the top 25% most-variant genes, by ranking expressed genes having a mean RPKM of at least 10 by the coefficient of variation. We generated consensus clustering results with NMF v0.5.02 in R v1.12.0, with the default Brunet algorithm, and 200 iterations for the clustering run. Rank survey profiles for cophenetic and silhouette width suggest a specific cluster solution.

Strand-specific ribodepletion RNA sequencing:

To remove cytoplasmic and mitochondrial ribosomal RNA (rRNA) species from total RNA NEBNext rRNA Depletion Kit for Human/Mouse/Rat was used (NEB, E6310X).

Strand-specific ribodepletion RNA sequencing:

Enzymatic reactions were set-up in a 96-well plate (Thermo Fisher Scientific) on a Microlab NIMBUS liquid handler (Hamilton Robotics, USA). 100ng of DNase I treated total RNA in 6 µL was hybridized to rRNA probes in a 7.5 µL reaction. Heat-sealed plates were incubated at 95oC for 2 minutes followed by incremental reduction in temperature by 0.1oC per second to 22oC (730 cycles). The rRNA in DNA hybrids were digested using RNase H in a 10 µL reaction incubated in a thermocycler at 37oC for 30 minutes. To remove excess rRNA probes (DNA) and residual genomic DNA contamination, DNase I was added in a total reaction volume of 25 µL and incubated at 37oC for 30 minutes. RNA was purified using RNA MagClean DX beads (Aline Biosciences, USA) with 15 minutes of binding time, 7 minutes clearing on a magnet followed by two 70% ethanol washes, 5 minutes to air dry the RNA pellet and elution in 36uL DEPC water. The plate containing RNA was stored at -80oC prior to cDNA synthesis.

First-strand cDNA was synthesized from the purified RNA (minus rRNA) using the Maxima H Minus First Strand cDNA Synthesis kit (Thermo-Fisher, USA) and random hexamer primers at a concentration of 8ng/µL along with a final concentration of 0.4µg/µL Actinomycin D, followed by PCR Clean DX bead purification on a Microlab NIMBUS robot (Hamilton Robotics, USA). The second strand cDNA was synthesized following the NEBNext Ultra Directional Second Strand cDNA Synthesis protocol (NEB) that incorporates dUTP in the dNTP mix, allowing the second strand to be digested using USERTM enzyme (NEB) in the post-adapter ligation reaction and thus achieving strand specificity.

cDNA was fragmented by Covaris LE220 sonication for 130seconds (2x65seconds) at a “Duty cycle” of 30%, 450 Peak Incident Power (W) and 200 Cycles per Burst in a 96-well microTUBE Plate (P/N: 520078) to achieve 200-250 bp average fragment lengths. The paired-end sequencing library was prepared following the BC Cancer Agency Genome Sciences Centre strand-specific, plate-based library construction protocol on a Microlab NIMBUS robot (Hamilton Robotics, USA). Briefly, the sheared cDNA was subject to end-repair and phosphorylation in a single reaction using an enzyme premix (NEB) containing T4 DNA polymerase, Klenow DNA Polymerase and T4 polynucleotide kinase, incubated at 20oC for 30 minutes. Repaired cDNA was purified in 96-well format using PCR Clean DX beads (Aline Biosciences, USA), and 3’ A-tailed (adenylation) using Klenow fragment (3’ to 5’ exo minus) and incubation at 37oC for 30 minutes prior to enzyme heat inactivation. Illumina PE adapters were ligated at 20oC for 15 minutes. The adapter-ligated products were purified using PCR Clean DX beads, then digested with USERTM enzyme (1 U/µL, NEB) at 37oC for 15 minutes followed immediately by 13 cycles of indexed PCR using Phusion DNA Polymerase (Thermo Fisher Scientific Inc. USA) and Illumina’s PE primer set. PCR parameters: 98˚C for 1 minute followed by 13 cycles of 98˚C 15 seconds, 65˚C 30 seconds and 72˚C 30 seconds, and then 72˚C 5 minutes. The PCR products were purified and size-selected using a 1:1 PCR Clean DX beads-to-sample ratio (twice), and the eluted DNA quality was assessed with Caliper LabChip GX for DNA samples using the High Sensitivity Assay (PerkinElmer, Inc. USA) and quantified using a Quant-iT dsDNA High Sensitivity Assay Kit on a Qubit fluorometer (Invitrogen) prior to library pooling and size-corrected final molar concentration calculation for Illumina HiSeq2500 sequencing with paired-end 75 base reads.

*Protocols were performed through the laboratory of Dr. Javed Khan.

RNA-seq Library construction and sequencing by Illumina HiSeq2000:

RNA-seq libraries were prepared using Illumina TruSeq Stranded Total RNA Sample Preparation kits according to the manufacturer's protocol. Briefly, ribosomal RNA was removed using Ribo-Zero Gold beads. After purification, total RNA was fragmented to 200nt pieces and then reverse-transcribed using reverse transcriptase and random primers. Second strand cDNA was synthesized using DNA polymerase I and RNase H. These cDNA fragments were added as a single base and ligated with adaptors. The products were purified and enriched with PCR to create the final RNA-seq libraries. RNA libraries were sequenced on Illumina HiSeq2000 using 100bp paired-end sequencing according to the manufacturer's protocol.

*Protocols were performed in the laboratory of Dr. Javed Khan.

Reads Alignment:

Align reads to reference genome (GRCh37) using Tophat version 2.0.8b with default options, expect for options specifying number of processor threads and fusion search. An example code for alignment with fastq files is shown below.

-o tophat.out –p 6 --fusion-search –fusion-min-dist 100000 GRCh37 read_1.fq read_2.fq

Gene and isoform expression:

Gene and isoform expression from RNA-seq data was generated using Cufflinks version 2.1.1. with default options and supplied reference annotation (Homo_sapiens.GRCh37.71.gtf) for estimation of expression. Cufflinks will not assemble novel transcripts, and it will ignore alignments not structurally compatible with any reference transcript.

Exon expression:

RPKM for a given ExonX is determined by:  ( (raw base counts / median read length) * 10^9) / (total reads * exon length). The raw base counts for a given ExonX is the total number of bases aligned to that genomic segment. Raw base counts are used instead of raw read counts because in many   cases only a portion of a read will align to a given exon. 

Gene fusion:

Gene fusion file was generated using defuse version 0.6.1 with default parameters and with reference annotation Homo_sapiens.GRCh37.69.  

*Protocols were performed through the laboratory of Dr. Javed Khan.

RNA-seq Library construction and sequencing by Illumina HiSeq2000:

RNA-seq libraries were prepared using Illumina TruSeq Stranded Total RNA Sample Preparation kits according to the manufacturer's protocol. Briefly, ribosomal RNA was removed using Ribo-Zero Gold beads. After purification, total RNA was fragmented to 200nt pieces and then reverse-transcribed using reverse transcriptase and random primers. Second strand cDNA was synthesized using DNA polymerase I and RNase H. These cDNA fragments were added as a single base and ligated with adaptors. The products were purified and enriched with PCR to create the final RNA-seq libraries. RNA libraries were sequenced on Illumina HiSeq2000 using 100bp paired-end sequencing according to the manufacturer's protocol.

*Protocols were performed in the laboratory of Dr. Javed Khan.

Reads Alignment:

Align reads to reference genome (GRCh37) using Tophat version 2.0.8b with default options, expect for options specifying number of processor threads and fusion search. An example code for alignment with fastq files is shown below.

-o tophat.out –p 6 --fusion-search –fusion-min-dist 100000 GRCh37 read_1.fq read_2.fq

RNA sequencing reads were aligned to GRCh37-lite genome-plus-junctions reference using BWA version 0.5.7. This reference combined genomic sequences in the GRCh37-lite assembly and exon-exon junction sequences whose corresponding coordinates were defined based on annotations of any transcripts in Ensembl (v59), Refseq and known genes from the UCSC genome browser, which was downloaded on August 19 2010, August 8 2010, and August 19 2010, respectively. Reads that mapped to junction regions were then repositioned back to the genome, and were marked with 'ZJ:Z' tags. BWA is run using default parameters, except that the option (-s) is included to disable Smith-Waterman alignment. Finally, reads failing the Illumina chastity filter are flagged with a custom script, and duplicated reads were flagged with Picard Tools.

Gene Coverage Analysis Protocol: www.bcgsc.ca/downloads/genomes/Homo_sapiens/hg19/1000genomes/bwa_ind/genome/  

Data Level: 3 Data File: *.gene.quantification.txt

The gene coverage analysis was performed with our internal analysis pipeline version 1.1 using "composite" gene annotations from the hg19 (GRCh37-lite) version of the TCGA GAF v3.0. These composite gene models were created in June 2011 by UNC (with assistance from UCSC) based on the annotations in the "UCSC genes" database. Each composite gene annotation was generated by collapsing all transcripts of that gene into a single model such that exonic bases in a composite gene model were the union of exonic bases from all known transcripts of the gene. Thus, the locations of the exonic boundaries used for the gene coverage analysis were not based on a single canonical transcript for each gene. Consequently, the exonic boundaries in a composite gene model may not correspond to the actual boundaries of the expressed transcripts. For simplicity, throughout this document and in the gene coverage results files, a composite gene model is simply referred to as a gene, and it is associated with the id of the gene whose transcripts contributed to that composite model. To generate the raw read counts, we first counted the number of bases of each read that were inside exonic regions in a gene, and then divided this total base count by the read length. Thus our values for the raw number of reads were not whole numbers (i.e. if the entire 50bp read mapped to an exon, we would add 50 to the total base count, which would ultimately contribute 1 to the raw read count. However, if only 25 bases of the read's alignment fell within an exon's boundaries, the total base count would be incremented by 25, which would ultimately contribute 0.5 to the raw read count). In order to comply with the file format specification enforced by the DCC validator, our raw read counts are rounded to the closest whole number. A gene's raw read count is the sum of raw read counts for exons belonging to the gene. Gene coverage is its raw read count divided by the sum of its exon lengths. RPKM is calculated using the formula: (number of reads mapped to all exons in a gene x 1,000,000,000)/(NORM_TOTAL x sum of the lengths of all exons in the gene )

[Note: NORM_TOTAL = the total number of reads that are mapped to all exons from the composite gene models. (i.e. sum of the fractional read count for all exons)]

If a read alignment contained a deletion or a large gap, the read did not contribute coverage inside the region spanned by the deletion/gap. Each of the paired end reads was counted separately. We excluded reads from pairs that failed Illumina's Chastity filter, as well as reads with mapping quality < 10.

*.gene.quantification.txt: A tab-delimited text file containing the following fields: - gene = Gene ID from GAF (version 3.0). The ID follows the nomenclature '<HUGO gene symbol>|<Entrez ID>'. If the combination of the HUGO symbol and the Entrez ID is not unique, an additional 'NofM' descriptor is added. An ID with '?' indicates that the HUGO gene symbol or Entrez ID is not available. e.g. U80769|?; TRNA_Pseudo|?|8of100 - raw_counts = Sum of fraction of reads (rounded off to nearest integer - restricted by the RNA-seq validator) that mapped to collapsed transcripts representing a specific gene. Reads from pairs that did not pass Illumina’s Chastity filter or with mapping quality less than 10, i.e. reads that did not map uniquely, were excluded from calculation. - median_length_normalized = Average coverage over all exons in the collapsed transcripts i.e. sum of the coverage depth at each base in all exons divided by the sum of the exon lengths - RPKM = Reads per kilobase of exon per million. Calculation described in detail below.

Exon Coverage Analysis:

Data Level: 3 Data File: **.exon.quantification.txt

The exon coverage analysis was performed with our internal analysis pipeline version 1.1 using "composite" gene annotations from the hg19 (GRCh37-lite) version of the TCGA GAF v3.0. These composite gene models were created in June 2011 by UNC (with assistance from UCSC) based on the annotations in the "UCSC genes" database. Similar to the gene coverage analysis, all transcripts of a given gene were collapsed into a single model such that exonic bases in a composite gene model were the union of exonic bases from all known transcripts of the gene. For simplicity, throughout this document and in the exon coverage results files, the collapsed exons are simply referred to as an exon. To generate the raw read counts, we first counted the number of bases of each read that were inside an exonic region, and then divided this total base count by the read length. Thus our values for the raw number of reads were not whole numbers (i.e. if the entire 50bp read mapped to an exon, we would add 50 to the total base count, which would ultimately contribute 1 to the raw read count. However, if only 25 bases of the read's alignment fell within an exon's boundaries, the total base count would be incremented by 25, which would ultimately contribute 0.5 to the raw read count). In order to comply with the file format specification enforced by the DCC validator, our raw read counts are rounded to the closest whole number. Exon coverage is the raw read count of an exon divided by its length. RPKM is calculated using the formula (number of reads (fractional) mapped to an exon x 1,000,000,000)/(NORM_TOTAL x length of an exon) [Note: NORM_TOTAL = the total number of reads (fractional) that mapped to exons, excluding those in the mitochondrial chromosome] If a read alignment contained a deletion or a large gap, the read did not contribute coverage inside the region spanned by the deletion/gap. Each of the paired end reads was counted separately. We excluded reads from pairs that failed Illumina's Chastity filter, as well as reads with mapping quality < 10.

**.exon.quantification.txt A tab-delimited text file containing the following fields: - exon = Exon coordinates according to GAF (version 3.0) with the nomenclature, chr<chromosome number>:<start coordinate>-<end coordinate>:<strand>. '.' in the <strand> indicates that there was no strand information available. e.g. chr10:120810487-120810613:. - raw_counts = Sum of fraction of reads (rounded off to nearest integer - restricted by the RNA-seq validator) that mapped to an exon. Reads from pairs that did not pass Illumina’s Chastity filter or with mapping quality less than 10 were excluded from calculation. - median_length_normalized = Average coverage over the exon i.e. the sum of the coverage depth at each base in an exon divided by the length of the exon. - RPKM = Reads per kilobase of exon per million.                 

Gene and isoform expression:

Gene and isoform expression from RNA-seq data was generated using Cufflinks version 2.1.1. with default options and supplied reference annotation (Homo_sapiens.GRCh37.71.gtf) for estimation of expression. Cufflinks will not assemble novel transcripts, and it will ignore alignments not structurally compatible with any reference transcript.

Exon expression:

Exon expression file was generated using dexseq_count.py included in R package DEXseq 1.12.1 with annotation (Homo_sapiens.GRCh37.71.gff) and default parameters except for –p yes (indicates the data is paired end) and –s no (indicates the data is not from a strand-specific assay).

Gene fusion:

Gene fusion file was generated using defuse version 0.6.1 with default parameters and with reference annotation Homo_sapiens.GRCh37.69.  

*Protocols were performed through the laboratory of Dr. Javed Khan.

RNA-seq Library construction and sequencing by Illumina HiSeq2000:

PolyA+ RNA was purified using the MACS mRNA isolation kit (Miltenyi Biotec, Bergisch Gladbach, Germany), from 5-10ug of DNaseI-treated total RNA as per the manufacturer’s instructions. Double-stranded cDNA was synthesized from the purified polyA+ RNA using the Superscript Double-Stranded cDNA Synthesis kit (Invitrogen, Carlsbad, CA, USA) and random hexamer primers (Invitrogen) at a concentration of 5µM. The cDNA was fragmented by sonication and a paired-end sequencing library prepared following the Illumina paired-end library preparation protocol (Illumina, Hayward, CA, USA). RNA samples were prepared by Illumina TruSeqRNA Sample Preparation V2 kits according to the manufacturer's protocol. Poly-A containing mRNA was purified using poly-T oligo-attached magnetic beads and then fragmented. RNA fragments of ~200bp were reverse-transcribed and ligated with adaptors for sequencing. RNA libraries were sequenced on Illumina HiSeq2000 using 100bp paired-end sequencing according to the manufacturer's protocol.

*Protocols were performed in the laboratory of Dr. Javed Khan.

Reads Alignment:

Align reads to reference genome (GRCh37) using Tophat version 2.0.8b with default options, expect for options specifying number of processor threads and fusion search. An example code for alignment with fastq files is shown below.

-o tophat.out –p 6 --fusion-search –fusion-min-dist 100000 GRCh37 read_1.fq read_2.fq

Gene and isoform expression:

Gene and isoform expression from RNA-seq data was generated using Cufflinks version 2.1.1. with default options and supplied reference annotation (Homo_sapiens.GRCh37.71.gtf) for estimation of expression. Cufflinks will not assemble novel transcripts, and it will ignore alignments not structurally compatible with any reference transcript.

Exon expression:

RPKM for a given ExonX is determined by:  ( (raw base counts / median read length) * 10^9) / (total reads * exon length). The raw base counts for a given ExonX is the total number of bases aligned to that genomic segment. Raw base counts are used instead of raw read counts because in many   cases only a portion of a read will align to a given exon. 

Gene fusion:

Gene fusion file was generated using defuse version 0.6.1 with default parameters and with reference annotation Homo_sapiens.GRCh37.69.  

*Protocols were performed through the laboratory of Dr. Javed Khan.

RNA-seq Library construction and sequencing by Illumina HiSeq2000:

PolyA+ RNA was purified using the MACS mRNA isolation kit (Miltenyi Biotec, Bergisch Gladbach, Germany), from 5-10ug of DNaseI-treated total RNA as per the manufacturer’s instructions. Double-stranded cDNA was synthesized from the purified polyA+ RNA using the Superscript Double-Stranded cDNA Synthesis kit (Invitrogen, Carlsbad, CA, USA) and random hexamer primers (Invitrogen) at a concentration of 5µM. The cDNA was fragmented by sonication and a paired-end sequencing library prepared following the Illumina paired-end library preparation protocol (Illumina, Hayward, CA, USA). RNA samples were prepared by Illumina TruSeqRNA Sample Preparation V2 kits according to the manufacturer's protocol. Poly-A containing mRNA was purified using poly-T oligo-attached magnetic beads and then fragmented. RNA fragments of ~200bp were reverse-transcribed and ligated with adaptors for sequencing. RNA libraries were sequenced on Illumina HiSeq2000 using 100bp paired-end sequencing according to the manufacturer's protocol.

*Protocols were performed in the laboratory of Dr. Javed Khan.

Reads Alignment:

Align reads to reference genome (GRCh37) using Tophat version 2.0.8b with default options, expect for options specifying number of processor threads and fusion search. An example code for alignment with fastq files is shown below.

-o tophat.out –p 6 --fusion-search –fusion-min-dist 100000 GRCh37 read_1.fq read_2.fq

RNA sequencing reads were aligned to GRCh37-lite genome-plus-junctions reference using BWA version 0.5.7. This reference combined genomic sequences in the GRCh37-lite assembly and exon-exon junction sequences whose corresponding coordinates were defined based on annotations of any transcripts in Ensembl (v59), Refseq and known genes from the UCSC genome browser, which was downloaded on August 19 2010, August 8 2010, and August 19 2010, respectively. Reads that mapped to junction regions were then repositioned back to the genome, and were marked with 'ZJ:Z' tags. BWA is run using default parameters, except that the option (-s) is included to disable Smith-Waterman alignment. Finally, reads failing the Illumina chastity filter are flagged with a custom script, and duplicated reads were flagged with Picard Tools.

Gene Coverage Analysis Protocol: www.bcgsc.ca/downloads/genomes/Homo_sapiens/hg19/1000genomes/bwa_ind/genome/  

Data Level: 3 Data File: *.gene.quantification.txt

The gene coverage analysis was performed with our internal analysis pipeline version 1.1 using "composite" gene annotations from the hg19 (GRCh37-lite) version of the TCGA GAF v3.0. These composite gene models were created in June 2011 by UNC (with assistance from UCSC) based on the annotations in the "UCSC genes" database. Each composite gene annotation was generated by collapsing all transcripts of that gene into a single model such that exonic bases in a composite gene model were the union of exonic bases from all known transcripts of the gene. Thus, the locations of the exonic boundaries used for the gene coverage analysis were not based on a single canonical transcript for each gene. Consequently, the exonic boundaries in a composite gene model may not correspond to the actual boundaries of the expressed transcripts. For simplicity, throughout this document and in the gene coverage results files, a composite gene model is simply referred to as a gene, and it is associated with the id of the gene whose transcripts contributed to that composite model. To generate the raw read counts, we first counted the number of bases of each read that were inside exonic regions in a gene, and then divided this total base count by the read length. Thus our values for the raw number of reads were not whole numbers (i.e. if the entire 50bp read mapped to an exon, we would add 50 to the total base count, which would ultimately contribute 1 to the raw read count. However, if only 25 bases of the read's alignment fell within an exon's boundaries, the total base count would be incremented by 25, which would ultimately contribute 0.5 to the raw read count). In order to comply with the file format specification enforced by the DCC validator, our raw read counts are rounded to the closest whole number. A gene's raw read count is the sum of raw read counts for exons belonging to the gene. Gene coverage is its raw read count divided by the sum of its exon lengths. RPKM is calculated using the formula: (number of reads mapped to all exons in a gene x 1,000,000,000)/(NORM_TOTAL x sum of the lengths of all exons in the gene )

[Note: NORM_TOTAL = the total number of reads that are mapped to all exons from the composite gene models. (i.e. sum of the fractional read count for all exons)]

If a read alignment contained a deletion or a large gap, the read did not contribute coverage inside the region spanned by the deletion/gap. Each of the paired end reads was counted separately. We excluded reads from pairs that failed Illumina's Chastity filter, as well as reads with mapping quality < 10.

*.gene.quantification.txt: A tab-delimited text file containing the following fields: - gene = Gene ID from GAF (version 3.0). The ID follows the nomenclature '<HUGO gene symbol>|<Entrez ID>'. If the combination of the HUGO symbol and the Entrez ID is not unique, an additional 'NofM' descriptor is added. An ID with '?' indicates that the HUGO gene symbol or Entrez ID is not available. e.g. U80769|?; TRNA_Pseudo|?|8of100 - raw_counts = Sum of fraction of reads (rounded off to nearest integer - restricted by the RNA-seq validator) that mapped to collapsed transcripts representing a specific gene. Reads from pairs that did not pass Illumina’s Chastity filter or with mapping quality less than 10, i.e. reads that did not map uniquely, were excluded from calculation. - median_length_normalized = Average coverage over all exons in the collapsed transcripts i.e. sum of the coverage depth at each base in all exons divided by the sum of the exon lengths - RPKM = Reads per kilobase of exon per million. Calculation described in detail below.

Exon Coverage Analysis:

Data Level: 3 Data File: **.exon.quantification.txt

The exon coverage analysis was performed with our internal analysis pipeline version 1.1 using "composite" gene annotations from the hg19 (GRCh37-lite) version of the TCGA GAF v3.0. These composite gene models were created in June 2011 by UNC (with assistance from UCSC) based on the annotations in the "UCSC genes" database. Similar to the gene coverage analysis, all transcripts of a given gene were collapsed into a single model such that exonic bases in a composite gene model were the union of exonic bases from all known transcripts of the gene. For simplicity, throughout this document and in the exon coverage results files, the collapsed exons are simply referred to as an exon. To generate the raw read counts, we first counted the number of bases of each read that were inside an exonic region, and then divided this total base count by the read length. Thus our values for the raw number of reads were not whole numbers (i.e. if the entire 50bp read mapped to an exon, we would add 50 to the total base count, which would ultimately contribute 1 to the raw read count. However, if only 25 bases of the read's alignment fell within an exon's boundaries, the total base count would be incremented by 25, which would ultimately contribute 0.5 to the raw read count). In order to comply with the file format specification enforced by the DCC validator, our raw read counts are rounded to the closest whole number. Exon coverage is the raw read count of an exon divided by its length. RPKM is calculated using the formula (number of reads (fractional) mapped to an exon x 1,000,000,000)/(NORM_TOTAL x length of an exon) [Note: NORM_TOTAL = the total number of reads (fractional) that mapped to exons, excluding those in the mitochondrial chromosome] If a read alignment contained a deletion or a large gap, the read did not contribute coverage inside the region spanned by the deletion/gap. Each of the paired end reads was counted separately. We excluded reads from pairs that failed Illumina's Chastity filter, as well as reads with mapping quality < 10.

**.exon.quantification.txt A tab-delimited text file containing the following fields: - exon = Exon coordinates according to GAF (version 3.0) with the nomenclature, chr<chromosome number>:<start coordinate>-<end coordinate>:<strand>. '.' in the <strand> indicates that there was no strand information available. e.g. chr10:120810487-120810613:. - raw_counts = Sum of fraction of reads (rounded off to nearest integer - restricted by the RNA-seq validator) that mapped to an exon. Reads from pairs that did not pass Illumina’s Chastity filter or with mapping quality less than 10 were excluded from calculation. - median_length_normalized = Average coverage over the exon i.e. the sum of the coverage depth at each base in an exon divided by the length of the exon. - RPKM = Reads per kilobase of exon per million.                 

Gene and isoform expression:

Gene and isoform expression from RNA-seq data was generated using Cufflinks version 2.1.1. with default options and supplied reference annotation (Homo_sapiens.GRCh37.71.gtf) for estimation of expression. Cufflinks will not assemble novel transcripts, and it will ignore alignments not structurally compatible with any reference transcript.

Exon expression:

Exon expression file was generated using dexseq_count.py included in R package DEXseq 1.12.1 with annotation (Homo_sapiens.GRCh37.71.gff) and default parameters except for –p yes (indicates the data is paired end) and –s no (indicates the data is not from a strand-specific assay).

Gene fusion:

Gene fusion file was generated using defuse version 0.6.1 with default parameters and with reference annotation Homo_sapiens.GRCh37.69.  

St Jude’s RNA-seq Protocol for ALAL:

At SJCRH, total RNA quality and quantity were assessed on Agilent RNA6000 chips (Agilent Technologies) and Qubit (Life Technologies). RNA-seq libraries were prepared from 500 ng of total RNA for each sample following Illumina RNA-seq protocols, including DNase treatment and phenol purification, cDNA conversion, fragmentation by Covaris Ultrasonicator, end repair, deoxyadenosine tailing, adaptor ligation and PCR amplification (ten cycles). Libraries with a 10 pM concentration were clustered on an Illumina cBot, and each flow cell was loaded onto a HiSeq instrument for sequencing using the Illumina 2×100 bp sequencing kit.

Last updated: May 30, 2019