Universal Human Papillomavirus Typing Assay: Whole-Genome Sequencing following Target Enrichment

ABSTRACT We designed a universal human papillomavirus (HPV) typing assay based on target enrichment and whole-genome sequencing (eWGS). The RNA bait included 23,941 probes targeting 191 HPV types and 12 probes targeting beta-globin as a control. We used the Agilent SureSelect XT2 protocol for library preparation, Illumina HiSeq 2500 for sequencing, and CLC Genomics Workbench for sequence analysis. Mapping stringency for type assignment was determined based on 8 (6 HPV-positive and 2 HPV-negative) control samples. Using the optimal mapping conditions, types were assigned to 24 blinded samples. eWGS results were 100% concordant with Linear Array (LA) genotyping results for 9 plasmid samples and fully or partially concordant for 9 of the 15 cervical-vaginal samples, with 95.83% overall type-specific concordance for LA genotyping. eWGS identified 7 HPV types not included in the LA genotyping. Since this method does not involve degenerate primers targeting HPV genomic regions, PCR bias in genotype detection is minimized. With further refinements aimed at reducing cost and increasing throughput, this first application of eWGS for universal HPV typing could be a useful method to elucidate HPV epidemiology.

H uman papillomaviruses (HPV) are double-stranded DNA viruses in the family Papillomaviridae. More than 200 genotypes are recognized based on the sequence of the approximately 8-kbp circular genome, with variants in each genotype based on sequence relatedness (1,2). Detection and typing of HPV has clinical and public health significance because of the association of some genotypes in the Alphapapillomavirus genus with anogenital and oropharyngeal cancers. As a result, most assays used in HPV epidemiology and natural history studies are directed to these genotypes, and many PCR-based assays use degenerate primers directed to the L1 region of the HPV genome (3,4). Studies examining disease associations of an increasingly broad spectrum of HPV genotypes have been hampered by the need for multiple assays to detect genotypes in alpha, beta, and gamma genera (5,6).
Next-generation sequencing (NGS)-based methods have the promise to improve the spectrum of HPV genotypes detected. Some NGS methods used PCR with consensus primers targeting the L1 region for library preparation, restricting regions of the genome that could be analyzed (7)(8)(9)(10)(11)(12)(13). HPV genome sequencing to single or limited HPV genotypes was achieved using the standard protocol for library preparation without any target enrichment (14) or after whole-genome amplification by rollingcircle amplification (15) or highly multiplexed degenerate primers (16). There are a few reports of NGS using target enrichment to study the mechanistic signatures of inte-gration of a limited number of HPV genotypes in cervical carcinomas (17)(18)(19). The goal of this study was to capitalize on the availability of new target enrichment technology to facilitate whole-genome sequencing to detect, genotype, and potentially characterize variant and integration statuses of all known HPVs belonging to alpha, beta, and gamma genera using a single assay. Target enrichment technology uses hybridization to purify genomic fragments of interest with DNA or RNA baits. DNA/RNA baits were initially used and proved to be highly effective to enrich human exomes for variant calling by deep sequencing (20,21). We developed custom RNA baits specific to all known HPV genomes to enrich the sample for sequencing of whole viral genomes. Here, we describe and provide an initial evaluation of this whole-genome sequencebased approach for broad-spectrum HPV genotype determination in samples with single and multiple HPV infection using the Agilent SureSelect target enrichment technology.

RESULTS
Assessment of sequence data quality. DNA sheared to approximately 150 bp is expected to increase in size to around 300 bp in the indexed library after ligation of adaptors that enable limited PCR amplification and sequencing of the library. As expected, bioanalyzer analysis of the indexed pooled libraries prepared for sequencing indicated the size distribution with a peak around 300 bp. The seed concentration of 5.3 pM generated cluster densities of 1,173,000 and 1,183,000/mm 2 in pool 1 (samples 1 to 16) and 2 (samples 17 to 32), respectively. All raw reads from both pools passed the default filtering of the Illumina BCl2fastq V1.8.4 software (pool 1, 268,096,524; pool 2, 226,336,668). The average number of reads per sample in pool 1 was 8,538,564, and that from Caski at 100 ng (sample 11) with ϳ500 HPV16 copies/cell dominated, comprising 21.9% of the total (Fig. 1A). The average number of reads per sample in pool 2 was 11,224,892. The mean base quality (Q) score for each sample (excluding sample 9, the water control) ranged from 34.57 to 36.87 (mean Q of 35.6), and 88% of the bases had quality scores greater than 30 (Fig. 1B). The water control generated 3,692 reads with Q scores ranging from 2 to 40 (mean Q of 25), with only 47% of bases having a Q score greater than 30.
Mapping results for internal control HBG. The fraction of the globin reference sequence mapped in all samples with genomic DNA (all except the water control, sample 9) ranged from 93 to 100% (mean, 98%). Samples with 100 ng genomic DNA (samples 1 to 8, 10 to 11, 13, 15, and 17 to 32) generated a mean of 7,406 reads (range, 1,903 to 11,225) that, using L1-S1 stringency, mapped to the globin reference (human beta-globin [HBG] nucleotides [nt] 2041 to 3480) with an average depth of coverage ranging from 132.2 to 779.5 (mean coverage, 514.3) (Fig. 2). The number of mapped reads (range, 902 to 3,327; mean, 2,000.3) and coverage (range, 62.6 to 231; mean, 138.9) were reduced in samples with 10 ng genomic DNA (samples 12, 14, and 16). Only 2 reads from the water control (sample 9) mapped to beta-globin (depth of coverage of 0.1 and fraction of reference sequence mapped of 12%).
Evaluating cutoff to improve signal/noise ratio for HPV genotyping from whole-genome sequence data. We evaluated mapping results for the 8 control samples ( Table 1) in terms of the number of reads, average coverage, and the fraction of genome covered using different stringencies to differentiate signal from noise for determination of HPV genotype. Caski (100 ng) generated a total of 50,890,604 reads mapped to any HPV reference sequence, of which 99.94% of reads (50,861,070 reads) mapped specifically to HPV16 under the less stringent L0.5-S0.8 mapping condition. Whole-genome sequencing (WGS) also detected HPV16 in SiHa (134,664 out of 164,138 total reads; 82.04%) cells and HPV18 in HeLa (314,969 out of 342,887 total reads; 91.86%) cells as the most dominant types. Without any cutoffs for nonspecific signal, all positive controls detected additional HPV types under all three mapping stringencies. The presence of nonspecific signal was also seen in water and placental DNA negative controls. For example, WGS detected HPV16 with 1,800 reads in the negative placental DNA using L0.5-S0.8. With increased mapping stringency (L1-S1), placental DNA still detected HPV16 with 861 reads and average coverage of 10.9. Based on this, we selected a cutoff of Ն1,000 mapped reads and average coverage of Ն20 for reliable sequence assignment. We also added the fraction of reference genome covered of Ն0.5 (to indicate that at least 50% of the viral genome is retained, allowing for loss due to potential integration events) to the cutoff parameters. With L1-S1 mapping stringency and the selected cutoffs (number of mapped reads of Ն1,000, average coverage of Ն20, and fraction of genome covered of Ն0.5), HPV genotyping results for the 8 controls were concordant with the expected results ( Table 1). The control cell line samples, Caski (ϳ500 HPV16 copies/cell), HeLa (ϳ50 HPV 18 copies/cell), and SiHa (1 to 2 HPV16 copies/cell), vary in known copy numbers of HPV and were analyzed at two concentrations of input DNA (100 ng and 10 ng). In each case, the number of mapped reads under stringent conditions roughly correlated with copy number (Table 1), and type assignment could be made with input of 10 ng DNA. The fraction of genome covered for HPV18 in the HeLa cell line was only 63%. No reads mapped to a 2.6-kbp central region (nt 3100 to 5730), compatible with a deletion (Fig. 3).
Determination of HPV types from WGS data. Enriched whole-genome sequencing (eWGS) data from the remaining 24 samples were analyzed using criteria set in the control samples to assign WGS HPV results without knowledge of prior HPV data. Comparing eWGS typing results with prior Linear Array (LA) results, type-specific HPV detection was at least partially concordant in 18 of 24 samples (75%) ( Table 2). Genotype determinations for the 9 HPV plasmids (HPV45, -58, -31, -33, -52, -6, -18, -11, and -16) were 100% concordant. The average depth of coverage for the 9 WHO plasmids ranged from 1,210 to 3,751 with a mean of 2,837. Among the 15 cervicova- a Three mapping stringencies (L0.5-S0.8, L0.5-S1, and L1-S1) were evaluated using combination of parameters L and S, which represent read length (0.5 or 1) and similarity score (0.8 or 1), respectively. b HPV types were determined following cutoff on number of mapped reads of Ն1,000, average coverage of Ն20, and fraction of genome covered of Ն0.5 under L1-S1 mapping stringency.
ginal swab samples, typing results for 9 samples were fully or partially concordant. Of these 9, full concordance was found in 5 samples, 2 samples negative for HPV and 3 samples with multiple types (3, 4, and 8 types). Among the 6 discordant samples, 4 were negative by eWGS but positive for multiple types by LA, whereas 2 samples were HPV positive by both methods but for different types. The overall type-specific concordance for types included in the LA assay was 95.83%, and for LA-targeted types, there were no instances of samples positive by eWGS and negative by LA. In six samples, eWGS identified HPV types not included in the LA assay (HPV30, -43, -68a, -87, -90, -91, and -114). In cervicovaginal samples with multiple HPV types, the average depth of coverage varied greatly from as low as 27 to 129,998 (Table 2), probably a result of varying viral loads. While some LA types were not detected by WGS in this small sample set (HPV26, -35, -53, -72, -81, and -84), other LA types were detected in some but not all samples (HPV18 and -66). In some instances, eWGS detected types but did not meet the criteria for number of reads or depth of coverage. For example, in sample 25, eWGS failed to call HPV66 and HPV18, which were detected by LA, but 1,478 reads mapped to HPV66 with 99% of the genome covered but failed because the average depth of coverage (18.9) was just below the cutoff of 20. HPV18 had a subthreshold of 234 reads mapped with 71% genome coverage. In sample 20, eWGS failed to call HPV72; although 1,412 reads mapped to HPV72 with 87% genome coverage, the average depth of coverage of 17.7 was below the cutoff of 20.  In sample 29, LA detected HPV84 but eWGS detected HPV114. HPV84 and HPV114 have 84% identity over the whole genome, suggesting the possibility of misclassification based on genomic fragments. To explore this, sample 29 reads were mapped with HPV84 as the only reference sequence. While no reads mapped to HPV84 under stringent conditions, under L0.5-S0.8 conditions 62,203 reads mapped to HPV84 with 94% of the genome covered. HPV53 was not identified in any of the 4 samples that were LA positive for this type; however, HPV53 sequences were found in one of four samples (number of reads, 1,438; average coverage, 18; genome coverage, 99%) if mapping was done under less stringent conditions (L0.5-S0.8).
LA includes probes for HPV64, which has been reclassified as a subtype of HPV34. For sample 17, 16,120 reads mapped to HPV34 with 16% coverage (compared to the whole genome), whereas 1,736 reads mapped to a 458-bp HPV64 L1 partial sequence (GenBank accession number AJ812226.1) with 85% coverage. As the full-length reference sequence for HPV64 is not available and HPV64 has been reclassified as a subtype of HPV34, the eWGS results were assigned to HPV64 based on post hoc assessment, giving partial concordance with LA results.
Evaluation of custom HPV bait. The performance of the HPV custom RNA bait pool for the 9 HPV types included as individual plasmids is shown in Table 3. One HPV6 bait sequence had to be removed due to homology with the human genome, resulting in

DISCUSSION
This is the first application of NGS for whole-genome identification of essentially all known HPV types (191 HPV types in the alpha, beta, and gamma genera) using RNA baits of the Agilent SureSelect target enrichment technology. The target enrichment method avoids the limitations of targeting only limited areas of the genome, minimizes the potential for PCR bias, and significantly increases the potential to increase the number of types identified in a single assay. A recent report using Roche NimbleGen DNA bait-based target enrichment largely focused on understanding the mechanistic signatures of integration of HPV types (87 types, 63% alpha) in cervical carcinomas (17), with limited data on NGS-based HPV type determination and concordance with a current HPV typing assay. In comparison, we mainly focused on the development of a universal HPV typing assay to detect all known HPV types in epidemiologic studies that increasingly examines a broad spectrum of HPV in alpha, beta, and gamma genera in both mucosal and cutaneous specimens for their disease association (5,6,11,22). Toward this goal, we evaluated our method in terms of the performance of RNA baits for whole-genome identification, level of enrichment, and a number of read-mapping metrics for determination of HPV types under single and multiple infection.
The custom RNA bait library exhibited excellent performance in terms of the fraction of genome coverage, percentage of on-target reads mapped to predicted HPV types, uniformity of coverage for the types that we evaluated, and an observed average level of target enrichment (184,483) by a factor of nearly 5 log. Our results support the concept of specific capture and whole-genome sequencing of viruses from clinical samples through target enrichment technologies (18,23,24). Sequencing the target enriched libraries generated high-quality reads with more than 88% of bases having Q scores greater than 30. Mapping results revealed that this method could detect single and multiple infections of HPV along with HBG as an internal control for cellularity.
The mapping stringencies and threshold for the number of mapped reads required to avoid bleed-through of nonspecific HPV signals was established using 8 control samples (3 cell line controls at 100 ng, 3 cell line controls at 10 ng, human placental DNA at 100 ng, and water at 0 ng). The most stringent L1-S1 mapping conditions, along with a cutoff of Ն1,000 mapped reads, average coverage of Ն20, and fraction of genome covered of greater than 50%, allowed reliable sequence assignment. The control cell lines, Caski (ϳ500 HPV16 copies/cell), HeLa (ϳ50 HPV 18 copies/cell), and SiHa (1 to 2 HPV16 copies/cell), vary in copy numbers of HPV and were analyzed at two concentrations of input DNA (100 ng and 10 ng in 50 l). In each case, the number of mapped reads under stringent conditions roughly correlated with copy number, and type assignment could be made with an input of 10 ng DNA. In agreement with previous reports using other methods (25,26), the fraction of genome covered for HPV18 in HeLa cells was only 63%, since no reads mapped to a 2.6-kbp central region (nt 3100 to 5730), compatible with a deletion due to integration.
The bleed-through of reads from dominant sequences in adjacent clusters has been reported in multiplexed samples sequenced using the Illumina platform (27,28). This might happen more frequently under relatively high seeding density for cluster generation and when one sequence dominates. These two conditions seemed to have taken place in this study. At the seeding density of 5.3 pM used in this study, pool 1 and pool 2 libraries generated cluster densities of 1,173,000 and 1,183,000, respectively, which is slightly higher than the recommended cluster density of 850,000 to 1,000,000. Additionally, over 31 million reads of HPV16 from Caski (100 ng) dominated sequences and corresponded to the HPV16 bleed-through found in other samples in the same pool. The HPV16 sequences in the placental DNA negative control in pool 1 had the unique 29 nucleotide substitutions identical to Caski HPV16 sequences (data not shown). It is unlikely that misidentification of barcode sequences contributed to bleed-through, since we restricted the analysis to reads with 0 mismatches in the index reads.
We applied the mapping stringency and cutoff determined with the control samples for determination of HPV types in 24 samples blinded to their HPV status. The HPV types determined by the eWGS were 100% concordant with LA in the 9 plasmid samples, but concordance was lower in the cervicovaginal samples ( Table 2). Nine of 15 samples (60%) showed complete or partial concordance, and type-specific concordance for the 37 LA types was 95.83%. It could be anticipated that biologic samples from exfoliated cells would present greater challenges to HPV detection than purified plasmid DNA. These samples vary significantly in the amount of cellular and viral material. Given the relationship between viral copy number and number of reads noted in the plasmid and cell line results, it could be suspected that low copy numbers contribute to failure to detect types found by LA. Additionally, LA used 10 l of extract regardless of the DNA concentration, while the eWGS method used 100 ng. Of the 4 samples that were LA positive but eWGS negative, three had 6 to 10 times (600 to 1,030 ng) more DNA input for LA. On the other hand, even with only 15 epidemiologic samples, eWGS detected 7 HPV types (HPV30, -43, -68a, -87, -90, -91, and -114) not targeted by LA. The ability of this eWGS method to detect HPV types not targeted by current assays indicates it could be useful in evaluation of HPV-negative cervical lesions (29) and in epidemiologic studies of HPV in geographic regions that may have uncommon types in circulation (30). Some HPV types detected by LA were not detected by eWGS in this small sample set (HPV26, -35, -53, -72, -81, and -84). HPV84 was also reported as discordant between NGS and LA in earlier studies (10,15). Further evaluation is needed to determine explanations for the failure to detect these types. Reduced mapping stringency did detect HPV53 and HPV84 in some instances and there could be unreported variants affecting mapping, and/or these types may be present at low copy numbers. We were able to confirm the specificity of RNA baits for HPV53 at the highest stringency of reference mapping by detecting HPV53 plasmid DNA (data not shown).
In conclusion, we developed and provided results from the initial evaluation of a non-PCR whole-genome sequence-based approach which is possibly close to a gold standard for broad-spectrum HPV genotype determination. The method relies on a custom-made RNA bait library that showed excellent performance in terms of genome coverage, percentage of reads mapped to predicted HPV type, uniformity of coverage, and the level of target enrichment. Further optimization and evaluation of this eWGS method for HPV detection is required in terms of cost per sample and throughput. Additional studies with samples including more types and optimization of sequencing conditions to minimize the bleed-through are needed. Studies evaluating the reproducibility and lower limits of detection are planned. Finally, it should be noted that despite enrichment with RNA bait by a factor of 5 log, 75% of the total reads (average) were human (off-target), with the exception of the sample with the highest viral load (100 ng Caski), in which 36% of total reads were off-target (data not shown). This suggests adjustment in the conditions for on-target enrichment could enhance results. Future laboratory and bioinformatics efforts are focused on determining the sensitivity and reproducibility of this HPV genotyping method, along with expanding its application for HPV detection and genotyping in samples from a variety of anatomical sites and development of a bioinformatics pipeline for automatic determination of HPV types and variants. This methodology is conceptually suitable for detecting both known and unknown HPV types through adjusting stringencies at the level of hybridization to RNA bait and reference mapping parameters. The current reference mapping stringency used to detect known HPV genotypes would likely miss unknown types.

MATERIALS AND METHODS
Samples. DNA extracted from three cells lines (ATCC, Manassas, VA) known to include HPV16 (Caski, ϳ500 copies/cell; SiHa, 1 to 2 copies/cell) and HPV18 (HeLa; ϳ50 copies/cell) served as HPV-positive controls, and water and human placental DNA (Sigma-Aldrich Corporation, St. Louis, MO) were included as HPV-negative controls. A panel of coded samples was prepared so that testing and analysis could be performed in a blinded fashion. The blinded set included 9 residual samples from past WHO validation panels (whole HPV genome DNA plasmids diluted to 50,000 copies/sample in human placental DNA [100 ng/50 l]) and 15 residual DNA extracts from anonymized cervical/vaginal samples previously typed using the Linear Array (LA; Roche Molecular Diagnostics, Indianapolis, IN). Two of the cervical/vaginal extracts were negative for the 37 genotypes targeted by LA, and the other 13 samples were selected to include multiple HPV genotypes (median, 5; range, 2 to 12). DNA was quantified by fluorescence-based PicoGreen assay (Molecular Probes, Inc., Eugene, OR), and volume was adjusted with a buffer (pH 8.0) composed of 10 mM Tris-HCl and 1 mM EDTA so that a final concentration of 100 ng/50 l was used for library preparation. The exception was the water blank (0 ng DNA) and cell line samples that were also prepared at 10 ng DNA to test the effect of reduced DNA input on sequencing results. The final number of samples sequenced was 32: 8 control samples (3 cell line controls at 100 ng, 3 cell line controls at 10 ng, human placental DNA at 100 ng, and water at 0 ng) and 24 blinded samples (9 plasmid samples and 15 cervical/vaginal extracts [100 ng]).
HPV whole-genome reference sequences. We retrieved the whole-genome HPV reference sequences for all 188 HPV genotypes that were available in the PapillomaVirus Episteme (PaVE) database (http://pave.niaid.nih.gov/) as of September 2014 (31). We also retrieved whole-genome sequences for three HPV subtypes detected by LA (HPV55 [an HPV44 subtype], HPV82A/IS39, and HPV68b) from the National Center for Biotechnology (NCBI) database (http://www.ncbi.nlm.nih.gov/). The 191 HPV reference genomes included in this study (see Table S1 in the supplemental material) ranged in size from 7,095 to 8,104 bp and were distributed phylogenetically into 5 genera (67 alpha, 47 beta, 74 gamma, 2 mu, and 1 nu). The complete reference sequences were used for RNA bait design as well as for sequence mapping.
Design and synthesis of RNA baits. Agilent Technologies Inc. (Santa Clara, CA) designed a custom RNA bait library, using eArray software, comprised of 120-base RNAs complementary to one strand of the whole genomes of all 191 HPV genotypes at 2-fold coverage (60 bases overlapping). The resulting 23,941 RNA sequences comprised a bait library size of 1.442 Mbp. The bait library sequences were subjected to BLAST (Basic Local Alignment Search Tool) search against the NCBI human genome database, and 21 fragments were identified to have homology to human sequences. Removing these 21 fragments from RNA bait synthesis resulted in 100% coverage for 182 HPV genotypes and 99.17% to 99.25% coverage for 9 HPV genotypes (HPV6, -19, -34, -71, -82A/IS39, -142, -171, -172, and -175), for an overall design coverage of 99.96%. A BLAST search of the final synthesized bait sequences against the full NCBI database identified 15,669 unique hits, with 99.94% being specific to HPV sequences. The other 10 hits (0.064%) shared identity to chimpanzee, feline, and macaque papillomaviruses. As an internal control for the quality of samples, we included 12 RNA fragments covering nt 2041 to 3480 of the coding sequence of the human beta-globin (HBG) gene (1,439 bp; GenBank accession number GU324922.1). The custom biotinylated RNA bait library (HPV and HBG) was synthesized at Agilent Technologies. Upon request to the authors, the custom-designed ID may be shared with those who want to synthesize these probes through Agilent Technologies.
Target enrichment library preparation and sequencing. The workflow for the preparation of the sequencing library followed Agilent Technologies' SureSelect XT2 target enrichment protocol (version D3), optimized for 100 ng DNA (Fig. 4). Briefly, DNA samples (50 l in a 96-well microTUBE plate) were sheared using a Covaris LE220 focused ultrasonicator with SonoLab 7.3.2.4 software (Covaris, Inc., Woburn, MA). The conditions used for shearing (duty factor, 15%; peak incident power, 450; treatment time, 490 s) generated DNA fragments with a peak around 150 bp as determined by the high-sensitivity DNA kit and a Bioanalyzer 2100 (Agilent Technologies). DNA fragments were end repaired and 3= A-tailed, and they underwent precapture indexing whereby each sample was barcoded with an index of 8 bp sequence. Indexed libraries were amplified by PCR with a limited number of cycles (8 cycles), followed by purification and assessment of the quality and quantity of each library by a Bioanalyzer 2100 (Agilent Technologies). Indexed libraries with unique barcodes were pooled (16 samples/pool) for overnight hybridization with custom RNA bait, followed by capture of hybridized fragments and 14 cycles of PCR to amplify indexed libraries. Following further purification, the quality and quantity of postcapture HPV enriched pooled libraries were assessed by a Bioanalyzer 2100 and quantitative PCR (qPCR) using a KAPA DNA library quantification kit (KAPA Biosystems, Wilmington, MA) and a LightCycler 480 (Roche Diagnostics, Indianapolis, IN). Pooled libraries were paired-end sequenced on a two-lane flow cell on Illumina HiSeq 2500 at the Centers for Disease Control and Prevention (CDC) Core Facility using TruSeq Rapid SBS kit HS (200 cycle) (Illumina, San Diego, CA) with Rapid run mode according to the following settings: read 1, 100 cycles; index (i7), 9 cycles; read 2, 100 cycles. For each DNA library, a seeding concentration of 5.33 pM was applied for cluster generation, with 5% PhiX virus genome library added as a sequencing control.
Bioinformatics. The raw sequence data were demultiplexed, and the adaptors and barcodes were removed using Illumina BCl2fastq V1.8.4. Reads with a base Q score were exported as fastq files for batch mapping to reference sequences using CLC Genomics Workbench 7.5 (CLC bio, Waltham, MA). The reference sequences of 191 HPV types/subtypes and HBG that were used for bait design were imported into CLC Genomics Workbench for mapping.
The analysis was directed to HPV detection and typing, so duplicate reads were not removed. We used reads with 0 mismatches in the index sequence for analysis in this report. Two parameters, read lengths (L0.5 or L1) and similarity scores (S0.8 or S1.0), were used to adjust mapping stringency. BAM (binary alignment/map) files were analyzed to generate mapping statistics by CLC Genomics Workbench. Three additional parameters were evaluated in the process of differentiating signal from noise: number of reads mapped to the HPV type-specific reference sequence, average depth of coverage for the mapped sequences, and fraction of HPV genome covered by mapped reads. HPV types in 8 control samples detected by eWGS with different mapping stringency and acceptance parameters were recorded prior to unblinding expected HPV results from 24 blinded samples. Subsequent matching of HPV genotype calls by WGS to expected results was used to evaluate performance of the assay and analysis.
RNA bait performance was evaluated based on results from single HPV plasmids: percentage of HPV reference genome covered by sequenced reads, percentage of reads that mapped to predicted types (compared to total HPV reads), and uniformity of coverage. To evaluate the uniformity, the reference sequences were divided into bins consisting of 300 bp, and the average mapping depth within each bin was calculated by Strand NGS software (http://www.strand-ngs.com/) using BAM files generated by CLC Genomics Workbench under L1-S1 mapping stringency. The mean and the standard deviation (SD) of the average mapping depths among the bins were calculated. The uniformity of coverage was calculated as the percentage of bins with coverage within the average read depth Ϯ 2 SD for all bins. The target enrichment factor for a sample was calculated using the following formula: (total HPV mapped reads/ total HPV and human mapped reads)/(HPV genome size/human diploid genome size) (32). HPV and human diploid genome sizes were considered to be 8 kbp and 6.6 Gbp, respectively, for this calculation.