Whole genome and exome sequencing reference datasets from a multi-center and cross-platform benchmark study

Detailed methods were described in our two companion papers^2,3.

Cell line culture and DNA extraction

HCC1395; Breast Carcinoma; Human (Homo sapiens) cells (expanded from ATCC CRL-2324) were cultured in ATCC-formulated RPMI-1640 Medium, (ATCC 30–2001) supplemented with fetal bovine serum (ATCC 30–2020) to a final concentration of 10%. Cells were maintained at 37 °C with 5% carbon dioxide (CO₂) and were sub-cultured every 2 to 3 days, per ATCC recommended procedures using 0.25% (w/v) Trypsin-0.53 mM EDTA solution (ATCC 30–2101), until appropriate densities were reached. HCC1395BL; B lymphoblast; Epstein-Barr virus (EBV) transformed; Human (Homo sapiens) cells (expanded from ATCC CRL-2325) were cultured in ATCC-formulated Iscove’s Modified Dulbecco’s Medium, (ATCC Catalog No. 30–2005) supplemented with fetal bovine serum (ATCC 30–2020) to a final concentration of 20%. Cells were maintained at 37 °C with 5% CO₂ and were sub-cultured every 2 to 3 days, per ATCC recommended procedures, using centrifugation with subsequent resuspension in fresh medium until appropriate densities were reached. Final cell suspensions were spun down and re-suspended in PBS for nucleic acid extraction.

All cellular genomic material was extracted using a modified Phenol- Chloroform-Iso-Amyl alcohol extraction approach. Essentially, cell pellets were re-suspended in TE, subjected to lysis in a 2% TritonX-100/0.1% SDS/0.1 M NaCl/10 mM Tris/1 mM EDTA solution and were extracted with a mixture of glass beads and Phenol- Chloroform-Iso-Amyl alcohol. Following multiple rounds of extraction, the aqueous layer was further treated with Chloroform-IAA and finally underwent RNases treatment and DNA precipitation using sodium acetate (3 M, pH 5.2) and ice-cold Ethanol. The final DNA preparation was re-suspended in TE and stored at −80 °C until use.

FFPE processing and DNA extraction

Please see Online methods in our companion paper² for details.

Illumina WGS library preparation

The TruSeq DNA PCR-Free LT Kit (Illumina, FC-121-3001) was used to prepare samples for whole genome sequencing. WGS libraries were prepared at six sites with the TruSeq DNA PCR-Free LT Kit according to the manufacturers’ protocol. The input DNA amount for WGS library preparation with fresh DNA for TruSeq-PCR-free libraries was 1 ug unless otherwise specified. All sites used the same fragmentation conditions for WGS by using Covaris with targeted size of 350 bp. All replicated WGS were prepared on a different day.

The concentration of the TruSeq DNA PCR-Free libraries for WGS was measured by qPCR with the KAPA Library Quantification Complete Kit (Universal) (Roche, KK4824). The concentration of all the other libraries was measured by fluorometry either on the Qubit 1.0 fluorometer or on the GloMax Luminometer with the Quant-iT dsDNA HS Assay kit (ThermoFisher Scientific, Q32854). The quality of all libraries was assessed by capillary electrophoresis either on the 2100 Bioanalyzer or TapeStation instrument (Agilent) in combination with the High Sensitivity DNA Kit (Agilent, 5067-4626) or the DNA 1000 Kit (Agilent, 5067-1504) or on the 4200 TapeStation instrument (Agilent) with the D1000 assay (Agilent, 5067–5582 and 5067–5583).

For the WGS library preparation from cross-site study, the sequencing was performed at six sequencing sites using three different Illumina platforms including HiSeq 4000 instrument at 2 × 150 bases read length with HiSeq 3000/4000 SBS chemistry (cat# FC-410-1003), and on a NovaSeq instrument at 2 × 150 bases read length using the S2 configuration (cat#PN 20012860), or on a HiSeq X Ten at 2 × 150 bases read length using the X10 SBS chemistry (cat# FC-501-2501). Sequencing was performed following the manufacturer’s instructions.

For the comparison study of WGS library protocol using different input DNA amounts, Illumina TruSeq DNA PCR-free protocol used 250 ng input DNA, Illumina TruSeq Nano protocol libraries were prepared with 1 ng, 10 ng, and 100 ng input DNA amounts. Illumina Nextera Flex libraries were prepared with 1 ng, 10 ng, and 100 ng input DNA amounts. These libraries sequenced at two sequencing sites using two different Illumina platforms including HiSeq 4000 instrument (Illumina) at 2 × 150 bases read length with HiSeq 3000/4000 SBS chemistry (Illumina, FC-410-1003) and NovaSeq instrument (Illumina) at 2 × 150 bases read length using the S2 configuration (Illumina, PN 20012860). Sequencing was performed following the manufacturer’s instructions.

For the tumor purity study, 1 µg tumor:normal dilutions were made in the following ratios using Resuspension Buffer (Illumina): 1:0, 3:1, 1:1, 1:4, 1:9, 1:19 and 0:1. Each ratio was diluted in triplicate. DNA was sheared using the Covaris S220 to target a 350 bp fragment size (Peak power 140w, Duty Factor 10%, 200 Cycles/Bursts, 55 s, Temp 4 °C). NGS library preparation was performed using the Truseq DNA PCR-free protocol (Illumina) following the manufacturer’s recommendations. The sample purity WGS libraries were sequenced on a HiSeq 4000 instrument (Illumina) at 2 × 150 bases read length with HiSeq 3000/4000 SBS chemistry (Illumina, FC-410-1003). Sequencing was performed following the manufacturer’s instructions.

Whole exome library construction and sequencing

SureSelect Target Enrichment Reagent kit, PTN (Part No G9605A), SureSelect Human All Exon v6 + UTRs (Part No 5190–8881), Herculase II Fusion DNA Polymerase (Part No 600677) from Agilent Technologies and Ion Xpress Plus Fragment kit (Part No 4471269, Thermo Fischer Scientific Inc) were combined to prepare library according to the manufacturer’s guidelines (User guide: SureSelect Target Enrichment System for Sequencing on Ion Proton, Version C0, December 2016, Agilent Technologies). Prior, during and after library preparation the quality and quantity of genomic DNA (gDNA) and/or libraries were evaluated applying QubitTM fluorometer 2.0 with dsDNA HS Assay Kit (Thermo Fischer Scientific Inc) and Agilent Bioanalyzer 2100 with High Sensitivity DNA Kit (Agilent Technologies).

WES libraries were sequenced at six sequencing sites with two different Illumina platforms, Hiseq 4000 instrument (Illumina) at 2 × 150 bases read length with HiSeq 3000/4000 SBS chemistry (Illumina, FC-410-1003) and Hiseq 2500 (Illumina) at 2 × 100 bases read length with HiSeq 2500 chemistry (Illumina, FC-401-4003). Sequencing was performed following the manufacturer’s instructions.

Whole genome FFPE sample library preparation and sequencing

For the FFPE WGS study, NEBNext Ultra II (NEB) libraries were prepared according to the manufacturer’s instructions. However, input adjustments were made according to the dCq obtained for each sample using the TruSeq FFPE DNA Library Prep QC Kit (Illumina) to account for differences in sample amplifiability. A total of 33 ng of amplifiable DNA was used as input for each sample.

FFPE WGS libraries were sequenced on two different sequencing canters on Hiseq 4000 instrument (Illumina) at 2 × 150 bases read length with HiSeq 3000/4000 SBS chemistry (Illumina, FC-410-1003). Sequencing was performed following the manufacturer’s instructions.

Whole exome FFPE sample library preparation and sequencing

For the FFPE study, SureSelect (Agilent) WES libraries were prepared according to the manufacturer’s instructions for 200 ng of DNA input, including reducing the shearing time to four minutes. Additionally, the adaptor-ligated libraries were split in half prior to amplification. One half was amplified for 10 cycles and the other half for 11 cycles to ensure adequate yields for probe hybridization. Both halves were combined after PCR for the subsequent purification step.

FFPE WES libraries were sequenced on at two sequencing sites with different Illumina platforms, Hiseq 4000 instrument (Illumina) at 2 × 150 bases read length with HiSeq 3000/4000 SBS chemistry (Illumina, FC-410-1003) and Hiseq 2500 (Illumina) at 2 × 100 bases read length with HiSeq 2500 chemistry (Illumina, FC-401-4003). Sequencing was performed following the manufacturer’s instructions.

PacBio library preparation and sequencing

15 ug of material was sheared to 40 kbp with Megarupter (Diagenode). Per the Megarupter protocol the samples were diluted to <50 ng/ul. A 1x AMPure XP bead cleanup was performed. Samples were prepared as outlined on the PacBio protocol titled “Preparing >30 kbp SMRTbell Libraries Using Megarupter Shearing and Blue Pippin Size-Selection for PacBio RS II and Sequel Systems.” After library preparation, the library was run overnight for size selection using the Blue Pippin (Sage). The Blue Pippin was set to select a size range of 15–50 kbp. After collection of the desired fraction, a 1x AMPure XP bead cleanup was performed. The samples were loaded on the PacBio Sequel (Pacific Biosciences) following the protocol titled “Protocol for loading the Sequel.” The recipe for loading the instrument was generated by the Pacbio SMRTlink software v5.0.0. Libraries were prepared using Sequel chemistry kits v2.1, SMRTbell template kit 1.0 SPv3, magbead v2 kit for magbead loading, sequencing primer v3, and SMRTbell clean-up columns v2. Libraries were loaded at between 4 pM and 8 pM.Sequencing was performed following the manufacturer’s instructions.

10X Genomics Chromium genome library preparation and sequencing

Sequencing libraries were prepared from 1.25 ng DNA using the Chromium Genome Library preparation v2 kit (10X Genomics, cat #120257/58/61/62) according to the manufacturer’s protocol (#CG00043 Chromium Genome Reagent Kit v2 User Guide). The quality of the libraries was evaluated using the TapeStation D1000 Screen Tape (Agilent). The adapter-ligated fragments were quantified by qPCR using the library quantification kit for Illumina (KK4824, KAPA Biosystems) on a CFX384Touch instrument (BioRad) prior to cluster generation and sequencing. Chromium libraries were sequenced on a HiSeq X Ten or a HiSeq 4000 instrument at 2 × 150 base pair (bp) read length and using sequencing chemistry v2.5 or HiSeq 3000/4000 SBS chemistry (Illumina, cat# FC-410-1003) across five sequencing sites.

Sequencing was performed following the manufacturer’s instructions.

AmpliSeq library construction and sequencing

AmpliSeq libraries were prepared in triplicate and prepared as specified in the Illumina protocol (Document # 1000000036408 v04) following the two oligo pools workflow with 10 ng of input genomic DNA per pool. The number of amplicons per pool was 1517 and 1506 respectively. The libraries were quality-checked using an Agilent Tapestation 4200 with the DNA HS 1000 kit and quantitated using a Qubit 3.0 and DNA high sensitivity assay kit. The libraries were applied to a MiSeq v2.0 flowcell. They were then amplified and sequenced with a MiSeq 300 cycle reagent cartridge with a read length of 2 × 150 base pair (bp). The MiSeq run produced 7.3 Gbp (94.5%) at ≥Q30. The total number of reads passing filter was 47,126,128 reads.

Whole exome library Ion platform sequencing

SureSelect Target Enrichment Reagent kit, PTN (Part No G9605A), SureSelect Human All Exon v6 + UTRs (Part No 5190–8881), Herculase II Fusion DNA Polymerase (Part No 600677) from Agilent Technologies and Ion Xpress Plus Fragment kit (Part No 4471269, Thermo Fisher Scientific Inc) were combined to prepare libraries according to the manufacturer’s guidelines (User guide: SureSelect Target Enrichment System for Sequencing on Ion Proton, Version C0, December 2016, Agilent Technologies). Prior, during, and after library preparation the quality and quantity of genomic DNA (gDNA) and/or libraries were evaluated applying QubitTM fluorometer 2.0 with dsDNA HS Assay Kit (Thermo Fisher Scientific Inc) and Agilent Bioanalyzer 2100 with High Sensitivity DNA Kit (Agilent Technologies).

For sequencing the WES libraries, the Ion S5 XL Sequencing platform with Ion 540-Chef kit (Part No A30011, Thermo Fisher Scientific Inc) and the Ion 540 Chip kit (Part No A27766, Thermo Fisher Scientific Inc) were used. One sample per 540 chip was sequenced, generating up to 60 million reads with average length of 200 bp.

10X Genomics Single Cell CNV library construction, sequencing and analysis

HCC1395 and HCC1395 BL were cultured as described above. 500,000 cells of each culture were suspended in 1 mL suspension medium (10% DMSO in cell culture medium). Cells were harvested the next day for single-cell copy number variation (CNV) analysis via the 10X Genomics Chromium Single Cell CNV Solution (Protocol document CG000153) produces Single Cell DNA libraries ready for Illumina sequencing according to manufacturer’s recommendations. Libraries were sequenced on a HiSeq 4000 instrument at 2 × 150 base pair (bp) read length and using sequencing chemistry v2.5 or HiSeq 3000/4000 SBS chemistry (Illumina, cat# FC-410-1003). Demultiplex BCL from sequencing run and Copy Number Variation analysis were performed using 10X Genomics Cell Ranger DNA version 1.1 software. CNV and heterogeneity visualization analysis was performed via 10X Genomics Loupe scDNA browser.

Affymetrix Cytoscan HD microarray

DNA concentration was measured spectrophotometrically using a Nanodrop (Life technology), and integrity was evaluated with a TapeStation 4200 (Agilent). Two hundred and fifty nanograms of gDNA were used to proceed with the Affymetrix CytoScan Assay kit (Affymetrix). The workflow consisted of restriction enzyme digestion with Nsp I, ligation, PCR, purification, fragmentation, and end labeling. DNA was then hybridized for 16 hr at 50 °C on a CytoScan array (Affymetrix), washed and stained in the Affymetrix Fluidics Station 450 (Affymetrix), and then scanned with the Affymetrix GeneChip Scanner 3000 G7 (Affymetrix). Data were processed with ChAS software (version 3.3). Array-specific annotation (NetAffx annotation release 36, built with human hg38 annotation) was used in the analysis workflow module of ChAS. Karyoview plot and segments data were generated with default parameters.

Reference genome

The reference genome we used was the decoy version of the GRCh38/hg38 human reference genome (https://gdc.cancer.gov/about-data/data-harmonization-and-generation/gdc-reference-files; GRCh38.d1.dv1.fa), which was utilized by the Genomic Data Commons (GDC).

The gene annotation GTF file was downloaded from the 10X website as refdata-cellranger-GRCh38-1.2.0.tar.gz, which corresponds to the GRCh38 genome and Ensmebl v84 transcriptome.

All the following bioinformatics data analyses are based on the above reference genome and gene annotation.

Preprocessing and alignment of WGS Illumina data

For each of the paired-end read files (i.e., FASTQ 1 and 2 files) generated by Illumina sequencers (HiSeq, NovaSeq, X Ten platforms), we first trimmed low-quality bases and adapter sequences using Trimmomatic⁴. The trimmed reads were mapped to the human reference genome GRCh38 (see the read alignment section) using BWA MEM (v0.7.17)⁵ in paired-end mode and bwa-mem was run with the –M flag for downstream Picard⁶ compatibility.

Post alignment QC was performed based both FASTQ on BWA alignment BAM files, the read quality and adapter content were reported by FASTQC⁷ software. The genome mapped percentages and mapped reads duplication rates calculated by BamTools (v2.2.3) and Picard (v1.84). The genome coverage and exome target region coverages as well as mapped reads insert sizes, and G/C contents were profiled using Qualimap(v2.2)⁸ and custom scripts. Preprocessing QC reports were generated during each step of the process. MultiQC(v1.9)⁹ was run to generate an aggregated report in html format. A standard QC metrics report was generated from a custom script. The preprocessing and alignment QC analysis pipeline is described in Suppl. Figure 1a.

Preprocessing and alignment of WES Illumina data

For each of the paired-end read files generated by Illumina sequencers (HiSeq 2500, HiSeq 4000 platforms), we first trimmed low-quality bases and adapter sequences using Trimmomatic. The trimmed reads were mapped to the human reference genome GRCh38 (see the read alignment section) using BWA MEM (v0.7.17) in paired-end mode. We calculated on-target rate based on the percentage of mapped reads that were overlap the target capture bait region file (target.bed). The post alignment QC methods are same as WGS Illumina data pre-processing.

DNA damage estimate for WGS, WES and FFPE samples

The DNA Damage Estimator(v3)¹⁰ was used to calculate the GIV score based on an imbalance between R1 and R2 variant frequency of the sequencing reads to estimate the level of DNA damage that was introduced in the sample/library preparation processes. GIV score above 1.5 is defined as damaged. At this GIV score, there are 1.5 times more variants on R1 than on R2. Undamaged DNA samples have a GIV score of 1.

Preprocessing and alignment of PacBio data

PacBio raw data were merged bam files using SMRTlink tool v6.0.1. which used minimap2¹¹ as default aligner. The non-human reads were removed and minimap BAM files were used for downstream analysis. Duplicate reads were mark and removed from PBSV alignment bases on the reads coming from the same ZMW, the base pair tolerance was set to 100 bp to remove the duplicated reads. The preprocessing and alignment QC analysis pipeline for PacBio data is described in Suppl. Figure 1b.

Genome coverage profiling

We used indexcov¹² to estimate coverage from the Illumina whole genome sequencing library cross-site comparison data set. The bam file for each library used as input to indexcov to generate a linear index for each chromosome indicating the file (and virtual) offset for every 16,384 bases in that chromosome. This gives the scaled value for each 16,384-base chunk (16KB resolution) and provides a high-quality coverage estimate per genome. The output is scaled to around 1. A long stretch with values of 1.5 would be a heterozygous duplication; a long stretch with values of 0.5 would be a heterozygous deletion.

Preprocessing and alignment of 10X Genomics WGS data

The 10X Genomics Chromium fastq files were mapped and reads were phased using LongRanger to the hg38/GRCh38 reference genome using the LongRanger v2.2.2 pipeline [https://genome.cshlp.org/content/29/4/635.full]. The linked-reads were aligned using the Lariat aligner¹³, which uses BWA MEM to generate alignment candidates, and duplicate reads are marked after alignment. Linked-Read data quality was assessed using the 10X Genome browser Loupe. MultiQC(v1.9) was run to generate an aggregated report in html format. A standard QC metrics report was generated from a custom script. The preprocessing and alignment QC analysis pipeline is described in Suppl. Figure 1a.

Preprocessing and alignment of Ion Torrent data

Raw reads were first filtered for low-quality reads and trimmed to remove adapter sequences and low-quality bases. This step was performed using the BaseCaller module of the Torrent SuitTM software package v5.8.0 (Thermo Fischer Scientific Inc). Low-quality reads were retained from further analysis in the raw signal processing stage. Low-quality bases were trimmed from the 5′ end if the average quality score of the 16-base window fell below 16 (Phred scale), cleaving 8 bases at once. Processed reads were mapped to the GRCh38 reference genome by TMAP module of the Torrent Suite software package using the default map4 algorithm with recommended settings. Picard (v1.84) was then used to mark PCR and optical duplicates on the BAM files.

Preprocessing and alignment for AmpliSeq

Low-quality bases and adapter sequences were trimmed with Trimmomatic. The trimmed reads were mapped to the human reference genome GRCh38 (see the read alignment section) using BWA MEM (v0.7.17) in paired-end mode. We calculated on-target rate based on the percentage of mapped reads that were overlap the target capture bait region file (target.bed). We counted the number of variant-supporting reads and total reads for each variant position with MQ ≥ 40 and BQ ≥ 30 cutoffs. The preprocessing and alignment QC analysis pipeline is described in Suppl. Figure 1a.

Somatic variant analysis

Four somatic variant callers, MuTect2 (GATK 3.8-0)¹⁴, SomaticSniper (1.0.5.0)¹⁵, Strelka2 (2.8.4)¹⁶, and Lancet (1.0.7)¹⁷, which are readily available on the NIH Biowulf cluster, were run using the default parameters or parameters recommended by the user’s manual. Specifically, for MuTect2, we included flags for “-nct 1 -rf DuplicateRead -rf FailsVendorQualityCheck -rf NotPrimaryAlignment -rf BadMate -rf MappingQualityUnavailable -rf UnmappedRead -rf BadCigar”, to avoid the running exception for “Somehow the requested coordinate is not covered by the read”. For MuTect2, we used COSMIC v82 as required inputs. For SomaticSniper, we added a flag for “-Q 40 -G -L –F”, as suggested by its original author, to ensure quality scores and reduce likely false positives. For TNscope (201711.03), we used the version implemented in Seven Bridges’s CGC with the following command, “sentieon driver -i $tumor_bam -i $normal_bam -r $ref–algo TNscope–tumor_sample $tumor_sample_name–normal_sample $normal_sample_name -d $dbsnp $output_vcf”. For Lancet, we ran with 24 threads on the following parameters “–num-threads 24–cov-thr 10–cov-ratio 0.005–max-indel-len 50 -e 0.005”. Strelka2 was run with 24 threads with the default configuration. The rest of the software analyzed was run as a single thread on each computer node. All mutation calling on WES data was performed with the specified genome region in a BED file for exome-capture target sequences.

The high confidence outputs or SNVs flagged as “PASS” in the resulting VCF files were applied to our comparison analysis. Results from each caller used for comparison were all mutation candidates that users would otherwise consider as “real” mutations detected by this caller.

GATK indel realignment and quality score recalibration

The GATK (3.8-0)-IndelRealigner was used to perform indel adjustment with reference indels defined in the 1000Genome project (ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/GRCh38_reference_genome/other_mapping_resources/ALL.wgs.1000G_phase3.GRCh38.ncbi_remapper.20150424.shapeit2_indels.vcf.gz). The resulting BAM files were then recalibrated for quality with “BaseRecalibrator” and dbSNP build 146 as the SNP reference. Finally,”PrintReads” was used to generate recalibrated BAM files.

Tumor ploidy and clonality analysis from whole genome and exome data

To estimate the HCC1395 cell line ploidy, we used PURPLE¹⁸ to determine the purity and copy number profile. To determine the clonality of HCC1395 and HCC1395 BL, we performed somatic SNV and CAN analysis using superFreq¹⁹. on capture WES datasets. Mapped and markDuplicate bam files of a pair of HCC1395 and HCC1395BL were used as input and bam files of the remaining replicates of the HCC1395BL library were used to filter background. Analysis was run using the superFreq default parameters. The clonality of each somatic SNV was calculated based on the VAF, accounting for local copy number. The SNVs and CNAs undergo hierarchical clustering based on the clonality and uncertainty across replicates for the tumor sample.

Assessment of reproducibility and O_Score calculation

we established following formula to measure reproducibility based on the overlapping SNVs:

$${O}_{score}=frac{{sum }_{i=1}^{ito n}left(left(frac{i}{n}right)times {O}_{i}right)}{{sum }_{i=1}^{ito n}{O}_{i}}$$

where n is the total number of VCF results in the pool set, i is the number of overlaps, O_i is the number of accumulated SNVs in the set with i number of overlapping.

Source link

Vasiprak Blog