Proficiency Testing of Virus Diagnostics Based on Bioinformatics Analysis of Simulated In Silico High-Throughput Sequencing Data Sets

Quality management and independent assessment of high-throughput sequencing-based virus diagnostics have not yet been established as a mandatory approach for ensuring comparable results. The sensitivity and specificity of viral high-throughput sequence data analysis are highly affected by bioinformatics processing using publicly available and custom tools and databases and thus differ widely between individuals and institutions.

tant part of validating high-throughput sequencing-based virus diagnostics and could improve the harmonization, comparability, and reproducibility of results. There is a need for the establishment of international proficiency testing, like that established for conventional laboratory tests such as PCR, for bioinformatics pipelines and the interpretation of such results. KEYWORDS high-throughput sequencing, external quality assessment, nextgeneration sequencing, proficiency testing, virus diagnostics H igh-throughput sequencing (HTS) has become increasingly important for virus diagnostics in human and veterinary clinical settings and for disease outbreak investigations (1)(2)(3). Since the introduction of the first HTS platform only about 1 decade ago, sequencing quality and output have been increasing exponentially, and costs per base have decreased. Thus, HTS has become a standard method for molecular diagnostics in many virological laboratories. The relatively unbiased approach of HTS not only enables the screening of clinical samples for common and expected viruses but also allows an open view without preconceptions about which virus might be present. This approach has led to the discovery of novel viruses in clinical samples, such as Bas-Congo virus, associated with hemorrhagic fever outbreaks in Central Africa (2); Lujo arenavirus in southern Africa (3); and a bornavirus-like virus, the causative agent of several cases of encephalitis with fatal outcomes in Germany (4). Considering the potential of HTS to complement or even replace existing "gold-standard" diagnostic approaches such as PCR and quantitative PCR (qPCR), quality assessment (QA) and accreditation processes need to be established to ensure the quality, harmonization, comparability, and reproducibility of diagnostic results. While the computational analysis of the immense amount of data produced requires dedicated computational infrastructure, as well as bioinformatics knowledge or software developed by (bio)informaticians, the interpretation of the results also requires evaluation by an experienced virologist or physician. In many cases, true-positive results may be difficult to discern among large numbers of false-positive results or may be entirely missing from result sets due to false-negative results. Interpretation of results also requires knowledge of anomalies that may arise through sequencing artifacts or contamination.
Proficiency testing (PT) is an external quality assessment (EQA) tool for evaluating and verifying sequencing quality and reliability in HTS analyses. The pioneer in EQA and PT for infectious disease applications of HTS has been the Global Microbial Identifier (GMI) initiative, which has been organizing annual PTs since 2015, focusing on sequencing quality parameters, including the detection of antimicrobial resistance genes, multilocus sequence typing, and phylogenetic analysis of defined bacterial strains (https://www .globalmicrobialidentifier.org/workgroups/about-the-gmi-proficiency-tests) (5). Subsequently, the concept was similarly established regionally for U.S. FDA field laboratories (6,7). COMPARE (Collaborative Management Platform for Detection and Analyses of (Re-) emerging and Foodborne Outbreaks in Europe (http://www.compare-europe.eu/) is a European Union-funded program with the vision of improving the identification of (novel) emerging diseases through HTS technologies. Participating institutions have hands-on experience in viral outbreak investigation. One of the ambitious goals is to establish and enhance quality management and quality assurance in HTS, including external assessment and interlaboratory comparison.
In this study, we present the results of the first global PT offered by the COMPARE network to assess bioinformatics analyses of simulated in silico clinical HTS virus data. The viral sequence data set was accompanied by a fictitious case report providing a realistic scenario to support the identification of the simulated virus included in the data set.
Tools and programs for bioinformatics analysis. In recent years, numerous tools, programs, and ready-to-use workflows have been established, making metagenomics sequence analyses accessible to scientists from all research fields. Workflows for the typical analysis of HTS data and for the identification of viral sequences are based on the same general tasks and tools, including quality trimming, background/host subtraction, de novo assembly, and sequence alignment and annotation. Sequence processing usually starts with obligatory quality assessment and trimming, using programs such as FastQC or Trimmomatic, including the removal of technical and low-complexity sequences or the filtering of poor-quality reads (8,9). Following these initial steps, many workflows include the subtraction of background reads, e.g., host and bacteria, to reduce the total amount of data and increase specificity, using tools such as BWA (Burrows-Wheeler Alignment Tool) or Bowtie 2 (10,11). De novo assembly of HTS reads into longer, contiguous sequences (contigs), followed by reference-based identification, has been shown to improve the sensitivity of pathogen identification. Such analyses depend heavily on the use of assemblers, such as SPAdes or VELVET, which make use of specific assembly algorithms, such as overlap-layout-consensus graph or de Bruijn graph algorithms (12,13). Alignment tools such as BLAST, DIAMOND (double-index alignment of next-generation sequencing [NGS] data), Kraken, and USEARCH are among the most important components in bioinformatics workflows for pathogen identification and taxonomic assignment of viral sequences (14)(15)(16)(17). Since commandline tools for HTS require specific knowledge in bioinformatics, complete workflows and pipeline approaches have been developed, including ready-to-use Web-based tools, such as RIEMS (reliable information extraction from metagenomic sequence data sets), PAIPline (PAIPline for the automatic identification of pathogens), Genome Detective, and others (18)(19)(20). Since the COMPARE in silico PT focuses on comparing different tools and software programs for bioinformatics analyses, an overview of frequently used programs is given in Table 1. A more extensive overview of virus metagenomics classification tools and pipelines published between 2010 and 2017 can be found at https://compare.cbs.dtu.dk/inventory#pipeline.

MATERIALS AND METHODS
Organization. The virus PT was initiated by the COMPARE network and organized by the Robert Koch Institute. Participation was free of charge for research groups experienced in analyzing HTS data sets, and the opportunity was announced through email and the COMPARE website.
Participants were asked to analyze an in silico HTS data set; the main goal was to identify the viral reads with their bioinformatics tools and workflows of choice and to interpret the results obtained, including final diagnostic conclusions.
An artificial, simulated in silico data set of Ͼ6 million single-end 150-bp Illumina HiSeq sequences derived from viral genomes, human chromosomes, and bacterial DNA was provided to 13 different European institutes for bioinformatics analysis toward the identification of viral pathogens in highthroughput sequence data. In order to assess how different levels of experience and/or bioinformatics methodologies affect the outputs and interpretation, participants were allowed to use their bioinformatics tools and workflows of choice. Participants were invited to report the PT results via an online survey within 8 weeks (from 16 September 2016 until 16 November 2016). Overall results were anonymized by the organizers, but each participant was provided with the identifier for its own results.
In silico HTS data set. The simulated in silico data set consisted of a total of 6,339,908 reads ( Table  2), based on a single-end 150-bp Illumina HiSeq 2500 system run with an empirical read quality score distribution of Illumina-specific base substitutions. The artificial data set was simulated with the ART program (21). Sequences were generated from the Human Genome Reference Consortium Build 38 (GRCh38; NCBI accession numbers CM000663 to CM000686), Acinetobacter johnsonii (NCBI accession number NZ_CP010350.1), Propionibacterium acnes (NCBI accession number NZ_CP012647.1), and Staph-  (Table 2). TTV and HSV-1 were included in the panel as the easiest sequences to identify (with 1,917 and 2,000 reads, respectively, and 100% nucleotide identity with the reference sequences), followed by a slightly altered MeV (1,000 reads, with 82% nucleotide identity to the reference genome) and, as the likely most difficult taxon, nABV (only 500 reads and 55% nucleotide identity to reference sequence JN014950.1). Participants. Thirteen participants applied for the COMPARE virus PT and completed the survey within the given time frame. Participants were registered from Belgium (n ϭ 1), Denmark (n ϭ 1), France (n ϭ 1), Germany (n ϭ 4), Greece (n ϭ 1), Italy (n ϭ 1), The Netherlands (n ϭ 2), Portugal (n ϭ 1), and the United Kingdom (n ϭ 1). The 13 participants represented 13 different institutes or organizations. Information about the participants' backgrounds is given below (see Table 4).
Case report. To simulate clinical relevance and to set the background for evaluation of the bioinformatics results, the following fictitious case report was provided with the data set: Recently, a 14-year-old boy from Berlin, Germany, was hospitalized with sudden blindness, reduced consciousness and movement disorders. The patient's mother reported developmental disorders starting 1 year ago, with concentration problems, uncontrolled fits of rage, overall decreasing performance in school and occasional compulsive head nods. Unfortunately, the patient had received neither medical examination nor treatment, but had attended psychological treatment, assuming behavioral problems.
Magnetic resonance tomography of the patient's brain showed white and gray matter lesions and gliosis. Soon after hospitalization, the patient showed a persistent vegetative state and died.
A sample of the boy's brain tissue was sequenced using the Illumina HiSeq 2500 platform, resulting in approximately 6 million single end reads of 150 bp each.
This case of subacute sclerosing panencephalitis (SSPE) can be caused by a persistent infection with a mutated MeV (22). However, the symptoms described could also be caused by HSV-1 or bornavirus-like viruses (4,23).
Reported PT results. Results were collected using the Robert Koch Institute's online survey software VOXCO. The survey contained 23 questions, including general participant information and specifications about the programs used, parameter settings, and computer specifications, as well as the final results of the PT, including an evaluation of the case (see Table S1 in the supplemental material). The responses were collected as single or multiple options from a multiple-choice questionnaire with additional free text for remarks and comments.
Analysis of PT results. The results were evaluated based on sensitivity (true-positive rate, i.e., the fraction of true virus reads that were identified), specificity, and the total time of the bioinformatics analysis ( Table 3). The time of analysis was evaluated based on the computational time only, without including the time for preparation and discussion of the bioinformatics results. Correlation of the time of analysis with computer and server specifications was based only on the use of online analysis, a personal computer, a server, and a high-performance virtual machine. Although pathogen identification by HTS-related metagenomics should naturally involve experienced qualified health professionals, participants were challenged to attempt an interpretation regardless of the background of the team performing bioinformatics. Given this context, no qualitative or quantitative scoring was performed in this part. Availability of data. The data set used in this study has been uploaded to the European Nucleotide Archive with the study accession number PRJEB32470.

RESULTS
PT results. The results of the PT were evaluated based on sensitivity, specificity, total turnaround time, and interpretation of results (Table 3). HSV-1 was identified by all participants (Tables 3 and 4; Fig. 1). For most of the participants, the identified read numbers for HSV-1 were complete or nearly complete (actual HSV-1 read count, 2,000). One participant identified more reads of HSV-1 than were present in the data set (participant 7; 8,361 reads identified).
TTV (actual read count, 1,917) and MeV were identified by all participants except for one (participant 4) (Tables 3 and 4; Fig. 1). For TTV, the read numbers identified were complete or almost complete for all participants, with the exception of participant 9, who was able to identify only 29% of the TTV reads. For the mutated MeV (actual read count, 1,000), 7 of the 13 participants were able to identify complete or almost complete read numbers (participants 3, 5, 6, 8, 10, 11, and 12), whereas 4 participants (participants 1, 2, 9, and 13) identified only 21%, 46%, 49%, and 34% of the total number of 1,000 reads, respectively (Table 3). Participant 4 was unable to identify MeV, and participant 7 assigned too many reads (1,411) as originating from the mutated MeV.
The divergent nABV (actual read count, 500) proved to be the most challenging target and was identified by only four of the participants (participants 3, 5, 6, and 12) (Tables 3 and 4; Fig. 1). The overall specificity for all bioinformatics workflows was high, with only participant 6 identifying 43 reads as a chordopoxvirus, a false-positive result.
The total times of analysis differed widely, from 3 h (participant 1) to 216 h (15 h of online analysis, with an additional 201 h waiting for server availability; participant 4) ( Table 5). Most workflows were calculated on a server system; two participants used a personal computer, and two used a virtual machine. One calculation was executed through an external public server.
Most of the workflows used in the COMPARE virus PT were quite similar, with the same basic tasks applied in different orders (Fig. 2). Most workflows started with trimming and quality filtering, followed by the subtraction of background reads, the assembly of remaining reads, and a final reference-based viral read assignment (Fig. 1). The databases used were custom-made or full databases from NCBI nt/nr GenBank (participants 1 to 4, 6 to 11, and 13). Participants 5 and 12 used viral sequences from NCBI GenBank only, while participant 7 also included a database for human-pathogenic viruses (ViPR) (https://www.viprbrc.org/brc/home.spg?decoratorϭvipr). All groups were also asked to correlate the results based on the bioinformatics analysis with the clinical symptoms described in the case report (Table 4). HSV-1 was suspected as the disease-causing agent by three groups, and MeV was identified by six groups. An MeV infection with HSV-1 possibly affecting the course of disease was named by two groups. nABV was interpreted as the single causative agent by two groups.

DISCUSSION
HTS-based virus diagnostics requires complex multistep processing, including laboratory preparation, assessment of the quality of sequences produced, computationally challenging analytic validation of sequence reads, and postanalytic interpretation of results. Therefore, not only comprehensive technical skills but also bioinformatic, biological, and medical knowledge is of paramount importance for proper analyses of HTS data for virus diagnostics.
HTS data can comprise several hundred thousand to many millions of reads from a single sequenced sample. Handling and analyzing such amounts of data pose computational challenges and currently require know-how and expertise in bioinformatics. Depending on the laboratory procedure, identification of viral reads from clinical metagenomics data is negatively affected by low virus-to-host sequence ratios and high viral mutation rates, making reference-based sequence assignments for highly divergent viruses challenging (24).
In silico bioinformatics analysis of HTS data can be separated into an analytic and a postanalytic step. The analytic step includes the processing of sequence reads with software tools or scripts assembled into workflows and pipelines. The postanalytic step  is evaluation of the results obtained from the bioinformatics analysis with regard to pathogen identification, often involving interpretation by an experienced, qualified health professional to correlate bioinformatics results with clinical and epidemiological patient information. The bioinformatics analysis and the technical identification of viral reads from the HTS data set were shown to have decreasing success as sequences became more divergent from reference strains, as exemplified by MeV, with 82% identity on the nucleotide level to its closest relative, and nABV, with just 52% identity on the nucleotide level to other bornaviruses, which was identified by only 4 of the 13 participants. MeV and TTV were missed by participant 4, whose analysis was based on the Kraken tool and an in-house workflow. Kraken is known to align sequence reads to reference sequences with high specificity and low sensitivity, making the alignment of mutated and divergent virus reads difficult (15). Since Kraken employs a user-specific reference database, TTV may have been absent from the custom database; Kraken was also used by participant 7, which was able to identify both MeV and TTV. It is noted that the use of different databases is an obstacle in bioinformatics analysis of HTS data. To date, there have been unified, curated virus reference databases only for influenza viruses (EpiFlu) (25), HIV (26) and human-pathogenic viruses (ViPR) (27). Recently, viral reference databases for bioinformatics analysis of HTS data have been developed (https://hive.biochemistry.gwu.edu/rvdb, https://rvdb-prot.pasteur.fr/) (28). NCBI offers the most extensive collection of viral genomes, but the lack of curation and verification of submitted sequences often leads to false-positive and false-negative results. To overcome such problems, reference-independent tools for virus detection in HTS data have been developed, making the discovery of novel viruses feasible without any knowledge of the reference genome (29). All of the participants that were able to identify the divergent nABV used workflows based on protein alignment approaches, including BLASTx/p, USEARCH, and DIAMOND, which are known to be highly sensitive (14,17). The identification of such highly divergent viruses is still challenging and cannot be accomplished by workflows with nucleotide-only reference-based alignment approaches. DIAMOND, which became available in 2015, was specifically designed for such sensitive analysis of HTS data at the protein level and is as much as 20,000 times faster than BLAST programs. Compared to other alignment tools, which seem to have a trade-off between speed and sensitivity, DIAMOND offers superior sensitivity for the detection of mutated and divergent viral sequences (14). However, the detection of such highly divergent viral sequences in patient samples is rare, and virus discovery is not a routine part of clinical virus diagnostics.
In terms of specificity, all workflows were highly specific; only workflow 6 showed the identification of a chordopoxvirus that was not present in the data set. Such false-positive results, as well as the excessive number of HSV-1 and MeV reads found by participant 7 (8,361 of 2,000 reads and 1,411 of 1,000 reads, respectively), can derive, for example, from low-complexity reads in the data set that are aligned to low-complexity or repetitive sequences of the viral reference genomes, from inappropriate matching score limits during filtering, or from inappropriate algorithm parameters. Furthermore, custom databases and viral references from NCBI can include sequences of human origin that can lead to false-positive results, resulting, in some cases, in nonreporting of other matches due to default algorithm reporting limits.
The total times of all workflows differed widely, from only 3 h to 216 h (15 h for the analysis and 201 h waiting for available servers). One of the fastest participants was participant 1, which needed only 3 h to perform the calculations on a scalable highperformance national virtual machine, whereas the slowest workflow (participant 4; 216 h) involved calculation on a personal computer through an external public server where bioinformatics software jobs are queued among many other users ( Fig. 1; Table  5). However, participant 5 also performed analysis on a notebook but within a much shorter time (26 h). Overall, workflows exclusively specified for virus detection or using only a viral or RefSeq database did not clearly correlate with shorter times than workflows with full metagenomics analyses. However, the specific composition of each database was not provided. To finally evaluate the performance of each bioinformatics workflow with regard to the time of analysis, all workflows should be run on the same computer system, but such standardization was not practical for this PT evaluation.
The COMPARE virus PT has further shown that both analytic work and postanalytic evaluation are of importance, since similar analytic results can be interpreted very differently, depending on the analyzing participant. Unlike standard routine virus diagnostic approaches such as PCR, where a medical hypothesis of relevance tests either positive or negative, HTS offers an extensive and largely unbiased catalogue of results. The etiological agent of a patient sample can be masked by false-positive results, sequencing contaminants, commensal viruses of the human virome, or viruses of yet unknown importance. Furthermore, the causative viral agent of a disease may be present in very low read numbers, because viral loads may be low, depending on the timing of sampling and the sample matrix. RNA viruses, among which are the most pathogenic human viruses, usually have smaller genomes than DNA viruses (30,31). Therefore, low read numbers from an RNA virus might be dismissed, resulting in a false-negative result. To assess sequencing results, some workflows and pipelines use cutoffs for read numbers so as to reduce false-positive results, but they may in the process make the detection of low-read-number matches less likely.
Since the analysis of HTS data for virus diagnostics requires bioinformatics as well as virological knowledge, collaboration between the two disciplines has been emphasized (32). Furthermore, automated pipelines for HTS-based virus diagnostics with unbiased evaluation of the pathogenicity and relevance of the pathogen detected have been implemented; these can help harmonize the analysis and interpretation of HTS sequence results (33).
A robust approach to viral diagnostics using HTS requires further refinement and validation. The COMPARE in silico PT is limited by the low complexity of the simulated data set. In vivo sequence data sets can comprise a highly diverse background and microbiome of the host, further increasing the difficulty of identifying viral reads. Further proficiency schemes with in vivo data sets and samples and wider collaboration are required to make progress. A second in silico PT organized by the COMPARE network has focused on the interpretation of the significance of foodborne pathogens in a simulated data set (unpublished data). Again, the interpretation of the results was shown to be one of the most diverse and critical points in HTS data analysis. Furthermore, third-generation sequencing technologies, such as MinION from Oxford Nanopore Technologies, are becoming available in many laboratories and field settings due to low cost and short sequencing times (34)(35)(36). However, analysis tools developed for second-generation sequencing technologies, such as the Illumina system, may not be applicable for third-generation sequencing data, due to the low sequencing accuracy of approximately 85% and the length of the sequences, which can be as long as 2 Mbp (37)(38)(39). Consequently, future PTs should also include the use of third-generation sequencing technologies, since those are likely to become part of routine laboratory diagnostics in the future.
Conclusion. The present availability of external quality assessment for HTS-based virus identification is limited. The COMPARE in silico virus PT has shown that numerous tools and different workflows are used for virus analysis of HTS data and that the results of such workflows differ in sensitivity and specificity. At present, there are no standard procedures for virome analyses, and the sharing, comparison, and reliable production of the results of such analyses are difficult.
Finally, there is a clear need for creating updated, highly curated, free, publicly available databases for harmonized identification of viruses in virome data sets, as well as mechanisms for conducting continuous ring trials to ensure the quality of virus diagnostics and characterization in clinical diagnostic and public and veterinary health laboratories.