Exploiting Bacterial Whole-Genome Sequencing Data for Evaluation of Diagnostic Assays: Campylobacter Species Identification as a Case Study

The application of whole-genome sequencing (WGS) to problems in clinical microbiology has had a major impact on the field. Clinical laboratories are now using WGS for pathogen identification, antimicrobial susceptibility testing, and epidemiological typing. WGS data also represent a valuable resource for the development and evaluation of molecular diagnostic assays, which continue to play an important role in clinical microbiology. To demonstrate this application of WGS, this study used publicly available genomic data to evaluate a duplex real-time PCR (RT-PCR) assay that targets mapA and ceuE for the detection of Campylobacter jejuni and Campylobacter coli, leading global causes of bacterial gastroenteritis. In silico analyses of mapA and ceuE primer and probe sequences from 1,713 genetically diverse C. jejuni and C. coli genomes, supported by RT-PCR testing, indicated that the assay was robust, with 1,707 (99.7%) isolates correctly identified. The high specificity of the mapA-ceuE assay was the result of interspecies diversity and intraspecies conservation of the target genes in C. jejuni and C. coli. Rare instances of a lack of specificity among C. coli isolates were due to introgression in mapA or sequence diversity in ceuE. The results of this study illustrate how WGS can be exploited to evaluate molecular diagnostic assays by using publicly available data, online databases, and open-source software.

A ccurate and timely diagnosis of infectious diseases is a cornerstone of clinical microbiology. Notwithstanding the ongoing importance of conventional culture in many settings, molecular diagnostics have markedly improved pathogen detection and identification (1). The most recent development in this area is the application of whole-genome sequencing (WGS) to problems in clinical microbiology (2)(3)(4)(5). Although WGS is transforming the field, genomics and rapid molecular tests have complementary roles to play in diagnostic microbiology, particularly in resourcelimited environments.
Since their introduction in the 1980s, nucleic acid amplification tests (NAATs), including multiplex assays that facilitate syndrome-driven diagnosis, have come to be widely used in bacteriology laboratories (1). In particular, multiplex NAATs are becoming increasingly popular for the identification of gastrointestinal pathogens, which include a wide range of viruses, bacteria, and parasites (6,7). Many NAATs have, however, been designed by using representative nucleotide sequences from a limited number of isolates. During the pre-WGS era, the performance of NAATs could not be examined at the population level because the requisite large isolate collections were challenging to assemble and primer sequences were difficult to determine by Sanger sequencing.
The recent increase in WGS has generated an abundance of publicly available genomic data that have the potential to improve the development and evaluation of NAATs and other molecular diagnostics (8). At the time of writing of this work, growing numbers of assembled bacterial genomes were becoming available in public repositories, such as the NCBI (https://www. ncbi.nlm.nih.gov /genomes/MICROBES/microbial_taxtree.html); however, the majority of WGS data were available only as unassembled short reads. This limited their use to laboratories with bioinformatics expertise and resources. The PubMLST databases (http://pubmlst.org) address this issue by making large numbers of de novo assembled bacterial genomes publicly available through a web interface with analysis tools (9,10). As of September 2016, the Ribosomal Multilocus Sequence Typing (rMLST) Database (http://pubmlst.org/rmlst/) (11) contained over 180,000 assembled bacterial genomes, which corresponded to more than 4,500 bacterial species. Similarly, an increasing number of species-specific PubMLST databases are also being populated with WGS data.
This study demonstrates how WGS data can be exploited to evaluate diagnostic assays. Campylobacter bacteria, a leading cause of bacterial gastroenteritis (12), were used as an exemplar. As Campylobacter bacteria are difficult to culture and identify, NAATs have become a popular tool for the diagnosis of campylobacteriosis (7,13,14); however, the extent to which existing assays are affected by the high levels of genetic diversity common among clinical isolates (15), or introgression, that is, the transfer of DNA between Campylobacter species (16)(17)(18), is unknown. The case study presented here is a duplex TaqMan real-time PCR (RT-PCR) for identification of Campylobacter jejuni and Campylobacter coli (19). Developed in the pre-WGS era using single gene sequences, the assay and variations thereof have been used for routine isolate identification to the species level (19)(20)(21); studies of Campylobacter isolates from humans (22)(23)(24)(25), animals (26)(27)(28)(29)(30)(31)(32)(33)(34)(35), and the environment (36); and outbreak investigations (37). For C. jejuni, the RT-PCR target is mapA, which encodes a putative outer membrane lipoprotein (38) shown to be immunogenic in chickens (39,40). For C. coli, the target is ceuE, which encodes a periplasmic binding protein involved in iron scavenging (41). As mapA and ceuE are present in both organisms, the species specificity of the assay is contingent on the conservation of primer-and probe-binding sequences. Accordingly, the mapA-ceuE RT-PCR provided an opportunity to explore the utility of genomics and population genetics approaches for in silico evaluations of diagnostic assays.

MATERIALS AND METHODS
WGS data. Best and colleagues (19) validated the mapA-ceuE assay by using clinical Campylobacter isolates from the United Kingdom. As Campylobacter genotypes circulating in Oxfordshire are representative of the United Kingdom (15,25,42,43) and other high-income countries (http: //pubmlst.org/campylobacter/), the Oxfordshire sentinel surveillance collection (15) was identified as an appropriate source of WGS data for this study. WGS data from 1,724 Campylobacter isolates were accessed via the Campylobacter jejuni/coli PubMLST database (http://pubmlst.org /campylobacter/). These Campylobacter bacteria comprised all of the single patient isolates recovered in Oxfordshire between June 2011 and June 2013.
Genome annotation and data extraction. The autotagger functionality within the PubMLST Bacterial Isolate Genome Sequence Database (BIGSdb) software (9) was used to identify mapA (PubMLST locus id CAMP0952), ceuE (CAMP1271), and the 7 multilocus sequence typing (MLST) (44,45) and 52 rMLST (11) loci. Sequences with Ն98% identity and Ն98% alignment with existing alleles were annotated automatically. Using curation tools available in PubMLST, predicted sequences with 70 to 98% identity to existing alleles were aligned at the nucleotide and amino acid sequence levels with the closest match in the database. Following visual inspection of the alignments, complete coding sequences were added to the database. Those with internal stop codons were "flagged," that is, highlighted in the database, and marked as "visually checked." Allelic data and corresponding nucleotide sequences, MLST-defined sequence types (STs), clonal complexes, and ribosomal STs (rSTs) were exported from the database by using the BIGSdb data export plugin (9).
Isolate diversity and species identification. The allelic diversity of the MLST and rMLST data was determined by using the bias-corrected version of Simpson's index of diversity (D) (46,47) with 95% confidence intervals (CIs) (48). Possible values of D ranged from 0 (no diversity) to 1 (maximum diversity). The distribution of MLST clonal complexes was compared to that observed for 3,349 human disease isolates recovered in Oxfordshire between 2003 and 2009 (42). Study isolates were assigned to species groups by using rMLST (11). For this analysis, concatenated nucleotide sequences of unique rSTs (ϳ20,780 bp) were aligned with MAFFT version 7.037b (49). The memory requirements for maximumlikelihood (ML) analysis of the study data set exceeded that of a standard installation of MEGA version 5.05; therefore, an ML phylogeny was generated on a Linux server with MEGA-CC version 7.0 (50) by using the general time-reversible model with gamma-distributed rates plus invariant sites with 500 bootstrap replicates (51). This analysis required knowledge of the command line and took 9 days. As usability and computational speed were considered important factors in this study, the ML phylogeny was compared to a neighbor-joining tree (52) reconstructed in MEGA version 5.05 (51) with the Kimura two-parameter model (53) by using 1,000 bootstrap replicates. At the population level, C. coli segregates into three clades (54), and additional rMLST analyses were carried out to resolve the assignment of a subset of isolates to these groups. The approach described above was used to compare rSTs of interest to a reference set of 15 C. coli genomes representative of the three clades, with the ML phylogeny generated with the Tamura-Nei model with gamma-distributed rates plus invariant sites with 500 bootstrap replicates (see Table S2 in the supplemental material) (16,55,56).
In silico assay evaluation. Nucleotide sequence alignments of unique mapA and ceuE alleles were generated as for the rMLST phylogeny, and regions corresponding to the forward primer, probe, and reverse primer (19) were extracted. Primer and probe nucleotide sequence fragments were aligned and concatenated, and unique combinations were assigned allele numbers in the order of discovery.
RT-PCR confirmation of in silico evaluation results. Archived genomic DNA and bacterial cultures were available for the study isolates (15), which facilitated RT-PCR confirmation of the in silico evaluation results. Representative isolates (n ϭ 124) were chosen for RT-PCR such that each unique mapA and ceuE forward primer, probe, and reverse primer combination was tested at least once, with the subset also representative of the genetic diversity of the study data set. For isolates with insufficient archived genomic DNA (n ϭ 5), glycerol stocks of singlecolony cultures were inoculated onto Columbia agar with horse blood (Oxoid Ltd., Basingstoke, United Kingdom) and incubated in a microaerobic atmosphere at 42°C for 48 h. Boiled cell lysates were prepared from single colonies as previously described (19). RT-PCR was carried out according to the method of Best et al. (19), and positive results were defined as those with cycle threshold (C T ) values ranging from 12 to 30. Genetic diversity, introgression, and selection in RT-PCR targets. Individual mapA and ceuE nucleotide sequence alignments and gene phylogenies were generated as described for rMLST. The mapA ML phylogeny was constructed with the Tamura three-parameter model with gamma distributed rates with 500 bootstrap replicates, and the same parameters were used for ceuE, with the addition of invariant sites. Nucleotide sequences were translated with MEGA version 5.05 (51), and allele numbers were assigned to unique protein sequences. STRUCTURE (57), a Bayesian clustering algorithm, was used to characterize introgression in mapA and ceuE as previously described (17,18). Isolates were probabilistically assigned to species with the linkage model, which adjusts for linkage disequilibrium between nucleotides (58). The model was run with default settings for 10,000 burn-in iterations and 10,000 additional iterations, assuming a population number (k) of 2. Putative mosaic alleles were identified as those with a Յ0.75 probability of belonging to either C. jejuni or C. coli (18). Site-by-site frequencies generated by STRUCTURE were used to identify nucleotide sequence fragments with different ancestries in putative mosaic alleles (18,58). After putative recombinant alleles were excluded, within-and between-group p distances were calculated for C. jejuni-and C. coli-specific gene and protein sequences with DnaSP version 5.10 (59). Species-specific synonymous and nonsynonymous substitution rates (dN/dS) were calculated for mapA and ceuE alleles encoding fulllength protein sequences with SNAP version 2. 1.1 (www.hiv.lanl.gov) (60).

Isolate diversity and species identification.
Complete nucleotide sequences of mapA and ceuE and the MLST and rMLST loci were obtained from 1,713/1,724 (99.4%) isolates (see Table S1 in the supplemental material), excluding those with: incomplete MLST and/or rMLST profiles (n ϭ 8), misassembled mapA or ceuE sequences (n ϭ 1), or multiple alleles at any of the rMLST loci (n ϭ 2), which is an indicator that a mixed culture may have been sequenced. The isolates included can be accessed via the Campylobacter jejuni/coli PubMLST isolate database and are grouped in the mapA-ceuE evaluation project. The collection comprised 293 STs (D ϭ 0.974 [95% CI, 0.972 to 0.976]) and 597 rSTs (D ϭ 0.989 [95% CI, 0.988 to 0.991]). The STs were assigned to 33 clonal complexes, with proportions similar to those observed previously in Oxfordshire (Fig. 1A) (15,42). Species designations were inferred from the ML and neighbor-joining rMLST phylogenies (11), which aggregated rSTs into identical species groups. As the same was true for all paired phylogenies, only the neighbor-joining trees are presented here. C. jejuni accounted for 1,521 (88.8%) isolates, and C. coli accounted for the remaining 192 (11.2%) (Fig.  1B). Two C. coli rSTs, rST398 (n ϭ 2) and rST4701 (n ϭ 1), were distinct from the other C. coli sequences and occurred at the tip of a long branch (Fig. 1B). Further rMLST analyses indicated that these isolates belonged to C. coli clade 3 (Fig. 1C).
In silico assay evaluation. There were 72 mapA alleles of 645 bp and 126 ceuE alleles of 990 to 994 bp represented in the isolate collection. Differences in ceuE allele lengths were due mainly to variation in three homopolymeric tracts, which resulted in internal stop codons in 11 C. jejuni-specific alleles (n ϭ 20) (Table 1). These alleles were flagged and marked as visually checked in the database. To evaluate assay specificity, primer-and probe-binding sequences were extracted from mapA and ceuE and analyzed in detail.
mapA. Twenty-three unique mapA primer-and-probe combinations were identified among the study isolates. Twelve were present only in isolates designated C. jejuni, nine were in C. coli, and two were in both species. Two distinct groups, consistent with microbiological species, were evident from the nucleotide sequence alignment of these unique combinations ( Fig. 2A). Primer and probe sequences were conserved among C. jejuni isolates. The predominant primer-and-probe combination was detected in 1,166 (76.7%) isolates and was identical to the published sequences (19). Sequence variation among divergent C. jejuni combinations was limited to between one and five polymorphisms across the three regions. C. coli sequences were also conserved but were divergent from C. jejuni combinations, differing from the published sequences (19) at up to 18 sites ( Fig. 2A); however, four C. coli isolates carried nonspecific primer-and-probe combinations (Table 2). Two combinations, each present in a single C. coli isolate, corresponded to predominant C. jejuni-specific alleles 1 and 2 ( Fig. 2A). The remaining two nonspecific combinations were composites of C. coli and C. jejuni sequences ( Table  2; Fig. 2A).

ceuE.
The 26 ceuE primer-and-probe combinations identified among the study isolates were all species specific. Twenty-one were present only in C. jejuni, and five were present only in C. coli. Primer-and-probe combinations were also stratified by species at the nucleotide sequence level (Fig. 2B). C. coli sequences were highly conserved, with 186 (96.9%) isolates identical to the published primer and probe sequences (19). Nucleotide variation was limited to between one and eight polymorphisms per primer-and-   TGAAGCAAAG ATTAAAGGGC TTTTATACAT TAGCGATGTT GGAATTCAAA TAAACGCACT TTAGACACTG GTATTG probe combination, the majority of which occurred in alleles 25 (n ϭ 2) and 26 (n ϭ 1), which were present in the clade 3 C. coli isolates ( Table 2; Fig. 2B). In contrast, C. jejuni combinations were divergent from the published sequences (19), containing between 10 and 13 nucleotide sequence differences across the three regions. C. jejuni isolates were also more evenly distributed across primer and probe sequences, with eight combinations accounting for 96.3% of the isolates, in contrast to a single combination accounting for 96.9% of the C. coli isolates (Fig. 2B).
Predicted assay performance and RT-PCR confirmation. Predicted species designations based on the results of the in silico evaluation were consistent with rMLST species assignments for 1,707/1,713 (99.7%) isolates, corresponding to 1,521 (100%) of the C. jejuni and 186 (96.9%) of the C. coli isolates. These results were confirmed by RT-PCR testing of 124 representative isolates. C. coli isolates with complete C. jejuni-specific mapA primer and probe sequences were mapA positive/ceuE positive ( Table 2). RT-PCR results for the C. coli isolate carrying C. jejuni-specific mapA forward primer and probe sequences and the three clade 3 C. coli isolates were inconclusive, as the C T values for mapA and ceuE, respectively, ranged from 32 to 37, exceeding the assay cutoff of 30 (19) (Table 2; see Table S3 in the supplemental material). Although this study was not designed to quantify the effects of primer and probe mismatches on target detection, there was a correlation between the number of polymorphisms and C T values (see Table S3).

Introgression, diversity, and selection in RT-PCR targets.
Additional analyses were carried out at the whole-gene level to explore the impact of introgression, diversity, and selection on assay specificity. Individual gene phylogenies confirmed that mapA and ceuE alleles were species specific (Fig. 3), with the exception of mapA alleles 20 and 88, which were present in the mapA-positive/ceuE-positive C. coli isolates ( Fig. 3A; Table 2). Clade 3 C. coli mapA and ceuE alleles clustered with the other C. coli sequences; however, they were distinct from clade 1 sequences and were at the end of a long branch in both phylogenies, indicative of genetic divergence (Fig. 3). Also noteworthy were five C. coli alleles that occupied intermediate positions on the mapA phylogeny (Fig. 3A). Taken together with the interspecies transfer of alleles 20 and 88, these findings indicated introgression in mapA, which supported the results of the in silico evaluation.
Drawing on population genetics approaches, introgression in the RT-PCR target genes was formally characterized with STRUCTURE, with mixed ancestry detected only in the mapA gene of 17 (8.9%) C. coli isolates. In addition to the two previously identified complete gene transfers, five putative mosaic alleles were detected (n ϭ 15) (Fig. 4A). All imported DNA was identical to the predominant C. jejuni sequence. Recombination breakpoints occurred within the amplified region in four introgressed alleles, of which alleles 19 and 111 corresponded to composite primer and probe sequences (Fig. 4B). Alleles 22 (n ϭ 11) and 93 (n ϭ 1) were not identified as introgressed sequences during the in  silico evaluation because the putative breakpoints occurred at the 5= end of the forward primer (Fig. 4B).
For both mapA and ceuE, between-species p distances were at least an order of magnitude greater than those within species, and lower levels of diversity were observed for the targeted species (Table 3). Analyses carried out at the allele level indicated that gene and protein diversity was primarily due to an abundance of rare alleles (see Fig. S1 in the supplemental material). The average dN/dS ratios for mapA and ceuE were Ͻ1 (0.077 to 0.14) in C. jejuni and C. coli, consistent with both genes being under stabilizing selection; however, the distribution of synonymous and nonsynonymous substitutions suggested species-specific differences in mapA and ceuE evolution (see Fig. S2 in the supplemental material).

DISCUSSION
This study demonstrates how bacterial WGS data can be used indirectly to support diagnostic laboratory activities. In silico analyses of primer and probe sequences, in conjunction with RT-PCR, confirmed that the mapA-ceuE assay was robust. Overall, 1,707 (99.7%) isolates were correctly identified, which was similar to the 97.7% level of accuracy reported during test validation by conventional approaches (19). Assay specificity was attributable to a combination of interspecies diversity and marked intraspecies conservation within primer-and probe-binding regions, in addi-tion to overrepresentation of sequences identical to published primers and probes (Fig. 2).
Experimental evidence suggests that the products of mapA and ceuE play roles in colonization (39) and iron acquisition (61), respectively, in Campylobacter species. Whole-gene analyses indicated that intraspecies diversity was low at the gene and protein levels and that both targets were under stabilizing selection (Table  3). One possible explanation for these findings is that mapA and ceuE encode essential cellular components in C. jejuni and C. coli. This is supported by the results of experimental studies in which mapA and ceuE mutants showed a reduced potential for chicken colonization (39,62). Species-specific differences in the distribution of synonymous and nonsynonymous substitutions in mapA and ceuE suggest divergent evolution in C. jejuni and C. coli postspeciation (see Fig. S2 in the supplemental material), perhaps because of host niche differences (35,54,(63)(64)(65). Interestingly, the predicted protein sequences of 11 C. jejuni-specific ceuE alleles (n ϭ 20) were truncated because of variation in three homopolymeric tracts, one of which occurred at the 5= end of the gene ( Table  1). Sequencing of NCTC 11168 demonstrated that homopolymeric tracts are common in the C. jejuni genome, with multiple variants detected in clones that were otherwise indistinguishable. These hypervariable sequences regulate gene expression through phase variation (66). It is possible that ceuE is also phase variable, although confirmation of homopolymeric tract lengths was beyond the scope of this study.
Although a small proportion of Campylobacter isolates could not be identified to the species level with the mapA-ceuE assay, both targets were universally present and no isolates were incorrectly identified. Only six (3.1%) C. coli isolates could not be identified, including two mapA-positive/ceuE-positive isolates and four isolates with inconclusive results due to late detection of mapA (n ϭ 1) or ceuE (n ϭ 3) (C T values, 32 to 37) ( Table 2; see  Table S3 in the supplemental material). Introgression by horizontal gene transfer (HGT) was the underlying cause of mapA detection among C. coli isolates (Fig. 4). HGT can result in the transfer of complete genes (whole-allele replacement) or the generation of mosaic alleles. While whole-allele replacements were relatively uncommon (1%), they accounted for the mapA-positive/ceuEpositive C. coli isolates. Mosaic alleles were more prevalent (7.8%) but resulted in only one inconclusive RT-PCR result. The apparent lack of introgression in ceuE may be due to functional and combinatorial epistasis, as the gene is part of the ceuBCDE operon,  (19). the products of which form an inner membrane ABC transporter system (41). Those isolates that could not be conclusively identified because of late detection of ceuE corresponded to clade 3 C. coli. While clade 1 C. coli strains account for the majority of human disease and agricultural isolates, strains belonging to clades 2 and 3 are generally from environmental sources (54). Phylogenetic analyses showed that clade 3 ceuE sequences were divergent from other C. coli-specific alleles (Fig. 3B). An accumulation of mutations in the forward primer region reduced the amplification efficiency ( Fig. 2B; see Table S3), indicating a lack of assay specificity for clade 3 C. coli. Taken together, the results of the in silico evaluation showed that the mapA-ceuE RT-PCR assay reliably identifies C. jejuni and C. coli, while whole-gene analyses provided insights into underlying reasons for the specificity of the assay. The value of these findings also extends to other diagnostic assays that use mapA or ceuE as a target (20,21,(67)(68)(69)(70). In the United Kingdom and other high-income countries, the impact of RT-PCR failures observed in this study would be limited because (i) C. coli accounts for a small proportion of human campylobacteriosis cases (12), (ii) the RT-PCR result was unaffected for the majority of isolates with introgressed mapA alleles, and (iii) clade 2 and 3 C. coli strains rarely cause human disease (54). Given that signals of host association are more marked than geographic signals (35), it is likely that the assay will perform well in other regions where food animals similar to those consumed in the United Kingdom are consumed; however, laboratories in regions with discernible differences in Campylobacter epidemiology should exercise caution and validate the assay prior to use.
The in silico approach to assay evaluation used here could be extended to other NAATs or molecular diagnostic tests. Compared to Sanger sequencing, WGS represents an attractive alternative for studying primer sequences, particularly those with mismatches that adversely affect amplification efficiency. The approach outlined in this study could be used to evaluate existing assays, or it could be applied in conjunction with primer design software during assay development, requiring only a personal computer with internet access and publicly available software.

FUNDING INFORMATION
This work, including the efforts of Melissa Jansen van Rensburg, was funded by Merton College Oxford. This work, including the efforts of Melissa Jansen van Rensburg, was funded by Clarendon Fund. This work, including the efforts of Melissa Jansen van Rensburg, was funded by Funds for Women Graduates. This work, including the efforts of Martin C.J. Maiden, was funded by Wellcome Trust (087622). This work, including the efforts of Melissa Jansen van Rensburg and Martin C.J. Maiden, was funded by DH | National Institute for Health Research (NIHR) (HPRU in Gastrointestinal Infections).