Hash-Based Core Genome Multilocus Sequence Typing for Clostridium difficile

Pathogen whole-genome sequencing has huge potential as a tool to better understand infection transmission. However, rapidly identifying closely related genomes among a background of thousands of other genomes is challenging. Here, we describe a refinement to core genome multilocus sequence typing (cgMLST) in which alleles at each gene are reproducibly converted to a unique hash, or short string of letters (hash-cgMLST).

ing single nucleotide polymorphisms (SNPs) identified following mapping to a reference genome offers high precision (1), but despite efforts to optimize computational approaches (2) it is relatively slow. In contrast, k-mer-based approaches based on hash algorithms, e.g., MASH (3) and PopPUNK (4), are fast, but the inherent and unstructured dimensionality reduction (e.g., summarizing the whole genome as 500 hash strings selected on the basis of sorted hash strings) can reduce precision in fine-scale transmission analyses. Core genome multilocus sequencing typing (cgMLST) (5) potentially provides a solution; genomes are summarized as a list of ϳ2,000 to 3,000 numbers, with each number representing the unique sequence of each core gene, i.e., structured dimensionality reduction. This summary enables more rapid comparisons, as, taking the example of Clostridium difficile, only 2,270 gene allele numbers need be compared (6), rather than having to compare 4.3 million base pairs of sequence data for SNPs. A drawback of cgMLST as described to date is that it requires a centralized database of alleles of each gene to be maintained so cgMLST profiles generated by different laboratories are comparable. This centralized support can potentially be provided by academic, public health, or commercial organizations, but any given scheme's sustainability is potentially limited by the funding available to support it. Additionally, for some pathogens, including C. difficile, several competing cgMLST/whole-genome MLST schemes (e.g., Enterobase [University of Warwick, UK], the cgMST.org Nomenclature Server [Ridom GmbH, Germany], and BioNumerics [bioMérieux, France]) containing different genes and profiles have been developed; the latter two are associated with commercial platforms for processing sequencing data.
We therefore propose an alternative to cgMLST as described to date. Instead of maintaining a database of alleles, each allele is reproducibly converted to a unique hash, or short string of letters. This compresses each item of identical data to the same smaller representation, based on the sequence of an allele alone. Therefore, this process can be undertaken independently in different laboratories without the need to maintain or subscribe to a central database, but it still generates summary data in a reproducible form that can be exchanged by laboratories. This distributed approach avoids the potentially costly need to maintain a central database.
This study has two main aims. The first is to demonstrate an implementation of hash-based cgMLST and to test whether hash-cgMLST profiles can be compared without a significant performance penalty compared to standard cgMLST; the second is to test the reproducibility and discriminatory power of cgMLST compared to SNPbased typing. The discriminatory power of cgMLST has been previously explored (for examples, see references 6-9); however, how cgMLST gene differences relate to SNP distances has not been comprehensively assessed. Instead, it is postulated that small numbers of SNPs are likely to fall in different genes, and so SNP distances and gene differences are likely to be similar for closely related isolates. We evaluate the extent to which this assumption holds. Related to this, only limited assessments of the reproducibility of cgMLST have been undertaken. The largest study to date involved the same Staphylococcus aureus DNA from 20 isolates undergoing sequencing in 5 laboratories (10). In this setting, in 80 comparisons (i.e., 20 sequences from 4 laboratories compared with the baseline laboratory), only 3 false gene differences were identified. We investigate whether these results can be replicated in C. difficile.

MATERIALS AND METHODS
Hash-cgMLST. Using the cgMLST scheme of Bletz et al. (6), the first allele for each of the 2,270 genes was used to create a BLAST search query. Following previous descriptions (6,10), BLAST searches for each gene required a 90% identity match, a matched length Ն99% of the query length, and the matched gene to be free from ambiguous characters or premature truncation. To avoid apparent truncated genes arising from misassembly, we checked the number of stop codons in the gene sequence and only retained matches with a single stop codon. To avoid truncation arising from contig breaks, we ensured that BLAST matches included the start and end of the query sequence. Other BLAST search parameters were as follows: "evalueϭ0.01, word_sizeϭ11, penaltyϭ-1, rewardϭ1, gapopenϭ5, gapextendϭ2." The resulting genes were either matched to the database available at cgMLST.org, i.e., standard cgMLST, or hashed using an md5 algorithm to create a 32-character hexadecimal string. Deletions relative to the search query, represented by dashes in the matched gene sequence, were removed prior to generating the hash. This avoids false differences introduced by locally variable placement of these deletions introduced by BLAST. The resulting cgMLST and hash-cgMLST profiles were saved as JavaScript Object Notation (JSON) files, i.e., a format that could readily be exchanged between laboratories. Where no BLAST match was found for a gene in the scheme, an empty value was recorded and that gene excluded in pairwise comparisons.
The choice of md5 hash provides 16 32 , i.e., 3.4 ϫ 10 38 , possible hashes. There is a theoretical chance of hash collisions, i.e., different sequences resulting in the same hash, but as the number of viable sequences for each gene in cgMLST databases is typically only tens to hundreds, this is very unlikely. Importantly, if a hash collision occurred, this would result in genomes appearing falsely more similar, rather than in falsely excluding potential transmission.
Sequence data. During whole-genome sequencing of C. difficile undertaken in Oxford and Leeds (UK), we have routinely resequenced a subset of isolates as part of our internal quality assurance. We searched our database for isolates sequenced more than once. For a subset of these replicate sequences, the same extracted DNA was used to generate both sequences; for the remainder, it was not documented in our laboratory information management system whether the same DNA extract was resequenced or whether a fresh DNA extract was made from the same frozen isolate (Table S1). Paired-end sequence data for both types of replicate were generated using Illumina technology, including various iterations of the HiSeq and MiSeq platforms, with read lengths ranging from 100 to 150 bp for the majority of the sequences (two 50-bp sequences were also included).
To compare the discriminatory power of hash-cgMLST compared to SNP-based typing, we processed 973 genomes from a previously published study of consecutive C. difficile infections over 1 year in six English hospitals using our hash-cgMLST and SNP pipelines (11).
Bioinformatic processing. For hash-cgMLST typing, raw sequence data underwent adapter trimming and quality trimming using bbduk.sh from the BBMap package (version 38.32) (12). Stringent quality trimming was applied following Mellmann et al. (10); both the left and right ends of each read were trimmed to a Q30 threshold (using BBDuk parameters "ktrimϭr kϭ23 minkϭ11 hdistϭ1 tpe tbo qtrimϭrl trimqϭ30"). Following this, the number of bases remaining in the trimmed reads was divided by the length of the 630 reference genome (13), 4,290,252 bp, to provide the mean high-quality coverage; this was required to be Ն50ϫ for a sequence to be included in the study. Appropriate quality trimming and adapter removal were confirmed using FastQC (14). To check for contamination with non-C. difficile DNA, the species origin of sequence reads was classified using Kraken2 (15) and the MiniKraken2_v1 database (built from the RefSeq bacteria, archaea, and viral libraries).
Following Bletz et al. (6), reads were de novo assembled using SPAdes (version 3.11.1) (16), with the "-careful" flag to reduce misassembly, using Burrows-Wheeler Aligner (BWA)-based mapping to confirm variants. Assembly quality metrics were obtained using the stats.sh script from BBMap (12). Samples with assembly sizes (base pairs in contigs) of Ͼ10% more or less than the median size were rejected. We also tested performance using SPAdes with an additional flag, "-only-assembler," to disable the SPAdes internal read correction procedure. As an additional comparison, reads were also de novo assembled using SKESA (version 2.3) (17) with default settings.
Reads (without stringent quality trimming) were also mapped to the 630 reference genome as described previously (1,11,18), using stampy (19) for mapping and mpileup (20) for variant calling, followed by quality filtering of variants. Variant calls were required to have a quality score of Ն30, be homozygous under a diploid model, be supported by Ն5 high quality reads (including Ն1 read in each direction and a consensus of Ն90% of bases), and not be within a repetitive region of the genome. See https://github.com/oxfordmmm/CompassCompact for example implementation. For inclusion, Ն70% of the reference genome needed to be called in the consensus sequence. Bases in the consensus sequence not passing quality filtering were denoted N rather than A, C, G, or T.
The bioinformatic pipelines used in this study for assembly and hash-cgMLST were written as Nextflow workflows (21) and can be found at https://github.com/davideyre/hash-cgmlst. Information on required dependencies and system requirements is provided in the repository readme file.
Analysis. Sequences meeting all quality thresholds (high-quality average coverage, assembly size, and proportion of reference genome called) were compared. For replicate sequences, when an isolate had been sequenced more than twice, a random sequence was chosen as the baseline sequence to which all other sequences from the same isolate were compared in order to avoid multiple counting.
Pairwise observed SNP differences between replicates and recombination-corrected SNP differences between other C. difficile genomes were obtained using Python scripts, PhyML (22), and ClonalFrameML (23), as previously described (11) (https://github.com/davideyre/runListCompare). Whole-genome alignments were used as input for PhyML. Invariant sites, i.e., those called as the same base as the reference or as an unknown base (N) across all genomes were set to be the same base as the reference for computational efficiency, given that there was no evidence of variation at these sites. All other sites had evidence of variation in at least one genome and were included unchanged, including any genomes with an N at that site. The maximum likelihood approach taken accounts for the uncertainty in the phylogeny arising from some genomes having an N called at some variable sites.
The number of cgMLST locus differences and number of loci compared were obtained using Python (https://github.com/davideyre/hash-cgmlst). Where no BLAST match was found for a gene in either (or both) of the genomes in a pairwise comparison, this was not counted toward the total number of cgMLST gene differences.
Data availability. Sequence Read Archive (SRA) accession numbers for analyzed replicate genomes are provided in Table S1, with explanatory notes in the accompanying legend. Data for the 973 genomes from six English hospitals can be found under NCBI BioProject accession number PRJNA369188.

RESULTS
Hash-cgMLST provided the same results as standard cgMLST with a minimal performance penalty. Results are presented throughout using pairwise core-gene differences generated with hash-cgMLST, as these were identical to standard cgMLST gene differences if novel alleles were accounted for.
Comparison of hash-cgMLST and SNP typing performance in replicate sequences. A total of 374 sequences from 104 isolates passed all quality checks and were available for comparison to investigate the reproducibility of sequencing followed by cgMLST for C. difficile transmission analyses. A median of 2 (interquartile range [IQR], 2 to 3; range, 2 to 27) sequences were available per isolate. Comparing replicate sequences with a randomly selected baseline sequence for each isolate yielded 272 comparisons for analysis.
With perfect sequencing, no variants would be expected between pairs of sequences from the same isolate (replicate pairs). Using reference-based mapping and variant calling, there were 0 SNPs between 262 (96%) replicate pairs, 1 SNP between 5 (2%) pairs, and 2 SNPs between 1 (Ͻ1%) pair, i.e., a mean of 0.026 SNPs per pair, which equates to 1 false SNP call per 39 sequences (Fig. 1A). Based on the rate of C. difficile evolution and the extent of within-host genetic diversity, Յ2 SNPs are expected between Ͼ95% of cases related by recent transmission (1); it is therefore unlikely that transmission would be falsely excluded on the basis of the error rates seen.
Predictors of false cgMLST gene differences. The observation of greater differences between replicates when restricted to variation in the 2,270 core genes versus considering SNPs across the whole genome is potentially counterintuitive. However, it should be remembered that the whole-genome SNP approach depends on a different bioinformatic approach with sophisticated per-variant quality filtering, whereas cgMLST is based on de novo assembly with more limited quality filtering. We therefore investigated potential predictors of false cgMLST gene differences using the hash-cgMLST algorithm (potential predictors were identical to those in the standard cgMLST approach) to see if filtering could be improved. Although we had already restricted our analysis to only include sequences with a mean genome coverage of Ͼ50ϫ, we investigated whether a more stringent threshold would improve performance (Fig. 2). There was no evidence that increased coverage was associated with fewer cgMLST gene differences (Spearman's rho ϭ Ϫ0.04; P ϭ 0.43). There were only 2 sequences in the data set with 50-bp reads; the remainder had 100-or 150-bp reads. In total, 14/222 (6%) sequence pairs in which the minimum sequence length was 100 bp contained Ͼ2 gene differences, compared to 4/48 (8%) in pairs with both 150-bp reads (exact P ϭ 0.54). The relationship between cgMLST gene differences and de novo assembly quality metrics is shown in Fig. 3A to C. Given the filtering applied, there was still an association between the number of false gene differences and the maximum absolute percentage deviation from the overall median assembly size (4,165,590 bp) within each replicate pair (which was constrained to be Յ10% for inclusion in the analysis) (Spearman's rho ϭ 0.21; P Ͻ 0.001; Fig. 3A, with both small and large assemblies contributing to this effect). The L 50 value describes the minimum number of contigs required to achieve 50% of the assembly size, with higher values representing more fragmented lower quality assemblies. Higher L 50 values were associated with greater rates of false gene differences (Spearman's rho ϭ 0.37; P Ͻ 0.001). A total of 9 (2%) of 257 pairs with both L 50 values of Յ125 had Ͼ2 false gene differences, compared to 9/15 (60%) with one or more sequences with an L 50 value of Ͼ125 (Fig. 3B). Another measure of assembly fragmentation is the total number of contigs; higher numbers of contigs were also associated with greater false gene differences (Spearman's rho ϭ 0.31; P Ͻ 0.001; Fig.  3C). Figure 3D shows the impact of the proportion of reads classified as C. difficile by Kraken2 on cgMLST gene differences. Within the data set, there was no evidence of significant contamination with a bacterial species other than C. difficile, and the most common species in all samples was C. difficile. However, the proportion of reads that could not be classified at all varied from 0 to 11% between sequences, with the FIG 2 Relationship between hash-cgMLST gene differences in replicate sequence pairs and average genome coverage and read length. Jitter applied to points to assist visualization. SPAdes with the "-careful" flag was used to generate assemblies. exception of one replicate pair (36% and 24%). Higher rates of unclassified sequences were associated with higher false gene differences but without any clear separation of the data on this basis (Spearman's rho ϭ Ϫ0.23; P Ͻ 0.001).
Distribution of cgMLST gene differences in replicate sequences. The gene differences observed between replicate sequences disproportionately affected a small number of genes (Table S2). Only 82 (4%) of 2,270 genes contained differences within the replicate sequences. To avoid multiple counting, we evaluated the number of isolates that contained at least a pair of replicates with gene differences. A total of 16 genes contained differences in two or more isolates' replicates and, of these, 15 were due to the same nucleotide differing in all replicate pairs. The reproducible location of the differences observed for a given gene across different isolates is compatible with consistent misassembly (Table S2). If the 15 genes with identical gene differences affecting Ն2 isolates were excluded, the number out of the 272 replicate pairs with 0 gene differences increased from 218 (80%) to 236 (87%), and the number of pairs with Ͼ2 gene differences reduced from 18 (7%) to 14 (5%). (Fig. S1B). Using the full 2,270-gene set and disabling SPAdes internal read correction resulted in fewer false FIG 3 Relationship between hash-cgMLST gene differences in replicate sequence pairs and de novo assembly quality metrics (A to C) and Kraken2 read classification (D). Jitter applied to points to assist visualization. One point is omitted from Fig. 3D for ease of visualization with the proportion of reads classified as C. difficile (0.64) and 0 gene differences. SPAdes with the "-careful" flag was used to generate assemblies. gene differences, namely, 0 differences in 236 (87%) pairs and Ͼ2 differences in 14 (5%) (Fig. S1C).
Benchmarking. Samples were processed in parallel, with each sample using a single core from an Intel Xeon Gold 6150 2.70-GHz 18-core central processing unit (CPU). For a single sample, the median (IQR) time to undertake quality control and read filtering was 3.6 (2.7 to 4.9) minutes and 27.4 (19.6 to 35.4) minutes, respectively, to generate an assembly using Spades with read error correction and 16.3 (12.1 to 21.5) minutes without; SKESA took 19.4 (15.5 to 24.3) minutes. Creating a hash-cgMLST profile from the assemblies took 44.1 (43.5 to 44.9) seconds. After making hash-cgMLST profile files, comparing a single genome to 100,000 others using a single CPU core took 40.4 s. In contrast, 100,000 comparisons using a standard cgMLST approach took marginally less time-38.7 s-after loading the profiles into memory.
cgMLST profiles can also be rapidly compared using a laptop or desktop; for example, using one core of an Intel i7 2.6-Ghz laptop processor, comparing the 973 samples from the six hospitals study required 467 Mb of memory and took 236 s for 472,879 comparisons, i.e., 49.9 s per 100,000 comparisons. Using the same laptop, creating hash-cgMLST profiles from existing assemblies typically took ϳ40 s and required Ͻ100 Mb of memory.
Comparison of hash-cgMLST and SNP typing in data from six English hospitals. We analyzed 973 genomes from a previous study of C. difficile transmission in six English hospitals (11). Of these, 56 failed the assembly size threshold and 20 failed the coverage threshold (one of these also failed the assembly threshold), leaving 898 (92%) genomes for analysis. We considered all pairs of genomes within Յ2 SNPs and used SPAdes (with the -only-assembler flag) or SKESA assemblies to test the extent to which the numbers of hash-cgMLST gene differences followed the number of SNPs ( Fig. 4A and C). Of 412 pairs of sequences differing by Յ2 SNPs, according to analysis using SPAdes assemblies, 376 (91%) were within Յ2 gene differences, 30 (7%) had 3 differences, and 16 (4%) had Ն4 differences; according to analysis using SKESA assemblies, 406 (99%) had Յ2 gene differences, and the remainder all had Յ5 differences. The median number of genes called in each pair was 2,143 (IQR, 2,084 to 2,191) using SPAdes and 2,003 (IQR, 1,891 to 2,110) using SKESA.
We also considered the distribution of recombination-corrected SNPs within pairs of genomes with Յ2 gene differences using hash-cgMLST. Following assembly with SPAdes, of 590 pairs of genomes, 376 (64%) were within Յ2 SNPs, with the maximum number of SNPs observed being 20 (Fig. 4B). Using SKESA analysis of 749 genome pairs, 406 (54%) were within Յ2 SNPs (Fig. 4D).

DISCUSSION
Here, we present the concept of hash-cgMLST as a tool for rapid comparison of bacterial sequencing data. This is a significant development from standard cgMLST approaches, as it removes the need for a central database of alleles. Such databases require resource-intensive curation to ensure they are maintained to a high standard. Additionally, allele numbering is currently done consecutively in a single location, which is problematic with large data sets that span many laboratories; hashes also overcome this limitation. We also provide the code to run the algorithms developed.
This article also highlights important limitations of common implementations of cgMLST as a tool for high-resolution outbreak detection. Stringent filtering done on the basis of mapped data allows the number of false variant calls to be controlled; here, we obtained around 1 false SNP for every 39 genomes sequenced. In contrast, fine-grained per-base quality control is typically not implemented in studies using de novo assembly tools. Using SPAdes, we observed a mean of 0.64 false gene differences per replicate genome pair. The alternative assembler tested, SKESA, was able to better control false gene differences, with 0.16 per replicate pair, i.e., 1 error per every 6.3 genomes sequenced. The higher rates of false variation observed using cgMLST/hash-cgMLST led to the counterintuitive observation in some samples of more differences when comparing 2,270 genes than when comparing the whole genome. It should be noted that undertaking SNP-based analyses from alignments of de novo assemblies without further filtering of variants would be similarly affected. These errors can be reduced by ensuring that the assemblies studied are of high quality. Our data suggest that the previously described read quality trimming and filtering based on assembly sizes (6, 10) could be further improved by also only analyzing samples with an L 50 value of less than ϳ125. However, this stringent filtering would have resulted in 30% of the previously published data set studied being unavailable for analysis, calling its practicability into question.
Although our approach does not depend on a database of alleles, it is dependent on the development of a high-quality cgMLST scheme, i.e., appropriate identification of core genes based on a large and diverse collection of genomes and careful selection of problematic genes for exclusion. Despite such an approach being taken in developing the C. difficile cgMLST scheme used, we show that removing a small number of genes from this cgMLST scheme would likely improve performance if using SPAdes assemblies, as a small subset of genes contained higher numbers of false gene differences (Table S2, Fig. S1). This highlights the importance of assessing the performance of each cgMLST scheme created on a per-species and per-scheme basis using appropriate test data sets that include replicate and closely related sequences.
Many of the apparent errors seen in replicate pairs appear to arise from misassembly. SPAdes-based read correction did not improve accuracy and instead resulted in more, rather than fewer, differences between replicate pairs. Use of an alternative assembler, SKESA (17), reduced the number of replicate pairs with Ͼ2 differences to just 1%, with a minimal reduction in the number of genes compared between replicate pairs (median of 2,225, compared to 2,227 with SPAdes). The reduction in genes compared was greater in the clinical data set analyzed (medians of 2,143 and 2,003), but this reduced discriminatory power for transmission studies will usually be more than offset by reduced error rates (and therefore reductions in erroneous exclusion of transmission).
Our data also highlight that extrapolating the Յ2-SNP threshold for identifying genetically plausible transmission events to two (or three [6]) gene differences may be inappropriate, depending on the choice of assembler and settings. Using SPAdes, 4% of pairs of samples within Յ2 SNPs were Ͼ3 genes different by cgMLST, whereas with SKESA this was only 1%. For public health applications optimized to identify potential transmission, to be Ն99% sure of not missing pairs of sequences within Յ2 SNPs, a threshold of Յ9 gene differences was needed for SPAdes assemblies and Յ3 differences with SKESA. However, these thresholds for SPAdes resulted in around 8 genome pairs that were Ͼ2 recombination-corrected SNPs apart being identified for every 1 pair within Յ2 SNPs (PPV, 11%) and 1.6 pairs that were Ͼ2 SNPs apart for every pair within Յ2 SNPs using SKESA (PPV, 38%). In this scenario, further SNP-based analysis based on mapping and filtered variant calling is likely to be required to determine which genomes are potentially related by recent transmission and which are not. In other cases, larger numbers of SNPs than gene differences were observed ( Fig. 4B and D), which may arise from SNPs outside core genes, SNPs in uncalled genes, and imperfect correction of recombination events.
Hash-cgMLST allowed rapid comparison of many thousands of bacterial genomes within seconds, using a relatively unoptimized Python script running on a single laptop or server CPU core. As comparisons with other genomes can be easily divided into independent parts, this task is readily parallelizable. Using hash-cgMLST, it is therefore potentially possible to compare each new sequence generated with millions of previous sequences. The summaries of each genome produced a roughly 130-kb JSON file, which is readily exchangeable between laboratories and could potentially be hosted alongside raw reads in sequence read archives. As such, each laboratory could maintain its own database of hash-cgMLST profiles and distances, as well as this information potentially being usefully provided as part of future web-based services based on publicly available data. Although, without further refinements, hash-cgMLST may not allow high-precision fine-scaled transmission studies, it has the potential to dramatically reduce the search space for closely related genomes, which can then be followed by more precise SNP-based analyses on a much smaller subset of genomes.
Using SPAdes, we observed a higher rate of "false" gene differences between genomes in which the sequences were potentially generated from separate DNA extractions of the same isolates compared to that in genomes obtained from the same DNA extraction. It is therefore plausible that the differences observed represent true differences but also a form of variation that is much faster and more erratic than mutation/recombination rates based on filtered SNPs. The erratic nature of the variation observed is unlikely to be informative about recent transmission. We also did not see these differences to the same extent using an alternative assembler, SKESA.
This study is potentially limited by not being an exhaustive investigation of all the potential options for assembly and for filtering de novo assembly data; in particular, further filtering of variants based on mapping reads back to assemblies, e.g., as done by Enterobase (24), may improve precision. Although we used Kraken2 to search for contamination with DNA from other species, contamination with C. difficile DNA from other concurrently processed samples may be an important contributor to some of the differences seen with hash-cgMLST, whereas resulting mixed calls can be filtered using mapped data.
In conclusion, appropriately quality controlled cgMLST can identify clusters of related genomes rapidly and is an appropriate tool for surveillance and reducing the search space in outbreaks. The SKESA assembler, compared to SPAdes, was associated with lower rates of gene differences between replicate sequences and, when used for hash-cgMLST, more closely matched the number of SNPs between closely related samples. The approach we describe has potential to be deployed across a range of pathogens, including those where linkage across time and wide geographic space, i.e., cases involving very large sequencing data sets, may help resolve sources and routes of transmission, such as for foodborne infections. Refined variant calling based on mapping is likely required to precisely define close genetic relationships. This study highlights the need for detailed quality assurance to determine the performance of algorithms used for comparing genomes. Our hash-cgMLST implementation is freely available and provides an effective database-free approach to cgMLST.