Comparative Genome Analysis and Global Phylogeny of the Toxin Variant Clostridium difficile PCR Ribotype 017 Reveals the Evolution of Two Independent Sublineages

ABSTRACT The diarrheal pathogen Clostridium difficile consists of at least six distinct evolutionary lineages. The RT017 lineage is anomalous, as strains only express toxin B, compared to strains from other lineages that produce toxins A and B and, occasionally, binary toxin. Historically, RT017 initially was reported in Asia but now has been reported worldwide. We used whole-genome sequencing and phylogenetic analysis to investigate the patterns of global spread and population structure of 277 RT017 isolates from animal and human origins from six continents, isolated between 1990 and 2013. We reveal two distinct evenly split sublineages (SL1 and SL2) of C. difficile RT017 that contain multiple independent clonal expansions. All 24 animal isolates were contained within SL1 along with human isolates, suggesting potential transmission between animals and humans. Genetic analyses revealed an overrepresentation of antibiotic resistance genes. Phylogeographic analyses show a North American origin for RT017, as has been found for the recently emerged epidemic RT027 lineage. Despite having only one toxin, RT017 strains have evolved in parallel from at least two independent sources and can readily transmit between continents.

Using whole-genome sequencing (WGS) and phylogenetic analysis, He et al. (4) identified the presence of two genetically distinct sublineages of RT027 through single-nucleotide polymorphism (SNP) analysis; both had emerged in North America within a relatively short period after acquiring the same fluoroquinolone resistanceconferring mutation containing an alteration in gyrA and a highly related conjugative transposon (4). The two epidemic sublineages showed distinct patterns of global spread, with one lineage spreading more widely and causing health care-associated outbreaks globally (4).
Traditionally, virulent C. difficile strains are characterized and identified in diagnostic laboratories by the presence of two potent toxins, TcdA and TcdB (30). These genes are located on a 19.6-kb pathogenicity locus (PaLoc). There is genetic variation in this region which can be exploited and which has revealed 30 different toxinotypes, including six A Ϫ B ϩ toxinotypes. The most common and clinically relevant is toxinotype VIII, and these isolates belong to RT017 (31). It is well known that the tcdA gene of this type contains a 1.8-kb deletion at the 3= end and a nonsense mutation at tcdA amino acid 47 that introduces a stop codon leading to a truncated tcdA gene (31). RT017 strains also lack the binary toxin (CDT) found in, for example, pathogenic RT027 strains that produce all three toxins. Despite lacking two toxins, clinically significant C. difficile infection (CDI) has been reported worldwide for the RT017 lineage (32)(33)(34)(35)(36)(37)(38)(39)(40)(41).
Here, WGS and phylogenetic analysis were used to define the population structure of a collection of 277 RT017 isolates from six continents of human and nonhuman origins with isolation dates between 1990 and 2013. Analyses reveal that RT017 strains have evolved in parallel from at least two independent sources and can readily transmit between continents. Genotypic and phenotypic antimicrobial susceptibilities were also compared.
We generated a maximum-likelihood phylogenetic tree based on the 1,288 SNPs that demonstrated the presence of two genetically diverse sublineages, SL1 and SL2 ( Fig. 1 and 2). Of the 1,288 SNPs, 76% (977/1,288) had a minor allele frequency (MAF) of Յ1% and/or were within 1 bp of an insertion or deletion. To control for false-positive identification of SNPs (these SNPs may mask the true phylogeny of RT017), phylogenetic trees with and without these SNPs were generated. The inclusion of 977 SNPs had a minor effect on the overall phylogenetic tree. Four SNPs were found to differentiate the two sublineages, one present in a noncoding region and three nonsynonymous SNPs (Table 3). SL2 is the most distantly related to the reference M68 strain of the two sublineages, and both sublineages are geographically and temporally widespread. All isolates from the previously reported study on London isolates fell into SL2 (39).
The RT017 strains are documented to have a higher level of antibiotic resistance than other C. difficile RTs (37,56). Fluoroquinolone resistance in C. difficile has been associated with mutations in codon 82 of the gyrA gene and codon 426 of the gyrB gene. The common SNP found in the gyrA gene is T82I, and those in the gyrB gene are A426V and A426A (57). Remarkably, we found 64.6% (179/277) to have the amino acid substitution found in the gyrA gene (T82I). A substitution in the gyrB gene (V426N) was present in 4.7% of strains (13/277), and an additional 10.1% (28/277), including M68, harbored a valine at position 426 of the predicted gyrB product ( Table 2; Information S1). The T82I substitution was globally distributed in both sublineages. Additionally, substitutions in the 81-bp rifampin resistance-determining region of the rpoB gene, R505K, H502N, and S485F, were found in 32.5% (90/277), 33.2% (92/277), and 1.1% (3/277) of isolates, respectively ( Table 2; Information S1).
To investigate horizontal gene transfer, a key mechanism driving C. difficile evolution, we performed programmatic and visual inspection of the comparisons, which revealed 56 regions of DNA between ϳ4 and ϳ61.5 kb that were absent from the M68 strain but present in other strains. These had 34 different insertion sites (Table 2 and Fig. 3; Information S1 and S4). Additionally, we found regions of DNA of between ϳ8 and ϳ29 kb present in the M68 strain at six sites but absent from multiple samples ( Table 2; Information S1 and S3). These insertions and deletions were associated with erythromycin, teicoplanin, tetracycline, chloramphenicol, and beta-lactam resistance genes, and their products potentially associated with virulence, such as a twocomponent response regulator, a SAM protein, an AntA/AntB antirepressor, a cell surface protein, and a sporulation-specific glycosylase (Information S3 and S4). The  deletions and insertions were well distributed geographically and temporally, and a 49-kb insertion found only in a clonal cluster of 23/37 London isolates in our previous study (39) was also found to insert at a different site in single isolates from Canada, the United States, and the United Kingdom, with isolation dates of 2006, 2006, and 2011, respectively (Fig. 3). Only one SNP was found in the toxin pathogenicity locus region, which was synonymous and present in the nonfunctioning tcdA gene fragment from five South Korean isolates in SL2 isolated between 2004 and 2008. Visual inspection of the comparisons revealed both tcdA and tcdB genes to be highly conserved; no sequence variations were found.
MICs were determined for eight C. difficile isolates (including M68 as a control) against the antibiotics chloramphenicol, rifampin, tetracycline, erythromycin, nalidixic acid, gentamicin, teicoplanin, and ampicillin. Their MICs are shown in Table 4. All isolates were resistant to nalidixic acid, gentamicin, and ampicillin, were either resistant or intermediately resistant to tetracycline, and were sensitive to teicoplanin. Out of eight isolates, two were resistant to chloramphenicol, four were resistant to rifampin, and seven were resistant to erythromycin.

DISCUSSION
The RT017 lineage, with its unique toxin profile and unusual global prevalence, has been overshadowed by the global outbreak of the RT027 lineage. Reminiscent of the RT027 lineage, two distinct sublineages of C. difficile RT017 that contain multiple FIG 1 Maximum-likelihood phylogenetic analysis of 277 global RT017 isolates based on core genome SNPs against the M68 reference. We used non-rare (Ͼ1% MAF) SNPs that were not in close proximity to insertions or deletions to determine the phylogenic tree. The SL1 and SL2 sublineages were differentiated by four SNPs (Table 3), with the reference strain M68 falling into SL2. The colored nodes indicate the geographical source of isolates. independent clonal expansions were revealed in this study. This division demonstrates that toxin variant strains emerged on at least one occasion, suggesting that a full toxin repertoire is not essential for efficient human-to-human transmission.
Based on our gyrA and gyrB SNP data, we predict up to 76.2% (211/277) of isolates are resistant to the fluoroquinolone class of antibiotics. Interestingly, the T82I SNP found in gyrA is the same mutation reported in the global outbreak of RT027 (4). Based on our MIC data, all eight isolates were resistant to nalidixic acid, indicating resistance to the fluoroquinolone class of antimicrobials.
Based on our rifampin SNP data, we predict 34.7% (96/277) of isolates in this study are resistant to the rifampin class of antibiotics. Interestingly, 82% (152/185) of these substitutions were found in SL1. R505K and H502N have previously been associated with rifampin resistance in C. difficile (60); however, based on our MIC data, only two (2/8) isolates were sensitive to rifampin, with one of the isolates containing the R505K and H502N SNP, indicating that these alone do not always lead to phenotypic resistance. Interestingly, S485F was found in three historical isolates from Wrexham, United Kingdom. This resistance-conferring SNP previously has been reported only in Myco- bacterium tuberculosis and not in C. difficile (61). All three isolates were phenotypically resistant to rifampin; however, all three isolates also contained the R505K SNP, confirming this SNP's contribution to resistance was not possible. The multiple haplotypes revealed is similar to those found for the RT027 global    study, where Ͼ100 distinct genotypes were found in 151 isolates. Despite SNPs and insertions and deletions, there was no variation in susceptibility to ampicillin, teicoplanin, gentamicin, or nalidixic acid. However, there was some variation with chloramphenicol, rifampin, tetracycline, and erythromycin. Whether the insertions carrying chloramphenicol o-acetyltransferase, the TetR-family transcriptional regulator, or the ermB gene played a role in this variation is unknown. Figure 4 depicts the phylogeny of the isolates by source. Interestingly, the 24 animal strains, which were all isolated from a similar location (Ontario, Canada) over a relatively short time period (2002 and 2005), are distributed among human isolates in SL1 only. This suggests there is transmission between humans and animals.
The ready global distribution of RT017 suggests determinants independent of toxin B are important in transmission. This could be related to the ready acquisition of antibiotic resistance determinants, efficient germination, and/or spore formation. This study provides the basis to further investigate factors important for the epidemic spread of C. difficile.
The deletions and insertions were well distributed geographically and temporally, suggesting either the rapid dissemination of strains or the multiple independent acquisitions and loss of DNA regions ( Fig. 2; Information S1). The insertion of different clusters of genes at the same site suggests hot-spot regions for the uptake of DNA (Information S4), and a 49-kb insertion found only in a clonal cluster of 23/37 London isolates in our previous study (39) was also found to insert at a different site in single isolates from Canada, the United States, and the United Kingdom, with isolation dates of 2006, 2006, and 2011, respectively (Fig. 3). This suggests these isolates have independently acquired this insertion. Similar to RT027, our analyses support a North American origin for RT017 with multiple, global transmission events, with its earliest movement into Europe in 1986 ( Fig. 4 and 5). The North American health system and practices appears to facilitate the ready evolution and epidemic spread of C. difficile for RT027 (4) and now, in this study, for RT017. Our data show that it was Europe that introduced RT017 to Asia and Australia, with subsequent spread from Asia to the Middle East, South America, and South Africa. The analysis indicates over 40 movements back and forth over the span of 30 years, consistent with population movements of a globalized society. Traditionally, it has been considered that RT017 strains emerged from Asia due to the reported high incidence of this RT that could not relate to or depend on toxin A-based assays for diagnosis (40). However, our analysis does not support an "out-of-Asia" hypothesis and supports a North American origin ( Fig. 4 and 5).
This study investigated the genetic diversity of 277 C. difficile RT017 isolates with temporal, geographical, and source variation. Phylogeographic analysis of the SNPs identified through WGS of the isolates suggests that there are two main sublineages of RT017 that share an ancestry and are globally disseminated. Both sublineages contain isolates from diverse geographical locations and isolation dates, with animal isolates spread among human isolates in SL1. Together with the haplotype diversity and geographically and temporally diverse presence of the transposable elements, these data suggest widespread transcontinental spread and recombination with independent acquisition and loss within different clusters.

MATERIALS AND METHODS
The 277 isolates described in this study are shown in Table 1 and included 37 isolates from a previous study (ENA study accession number ERP009770) (39), with the remaining being new to this study (ENA study accession number PRJEB11868). These were of human (n ϭ 251), environmental/hospital ward (n ϭ 2), equine (n ϭ 4), canine (n ϭ 11), and bovine (n ϭ 9) origin, with isolation dates between 1990 and 2013. These isolates were subjected to genomic DNA extraction as previously described by Stabler et al. (62). WGS data for the isolates was obtained using either the HiSeq 2000 sequencing system or the MiSeq sequencing system (Illumina, California), and libraries were created as previously described (63) or using a Nextera XT kit (Illumina, California), respectively. The sequence data were processed and quality controlled according to a standard pipeline as previously described (64). Briefly, FASTQ-formatted sequencing reads were quality controlled with a minimum quality Phred score of 30 (as a rolling average over 4 bases) using trimmomatic (65). The resulting reads were mapped using the BWA-MEM (66) software against the M68 C. difficile reference strain, and the majority of posttrimmed reads (Ͼ92% for all samples passing quality control) were mapped to the reference. SNPs were called using SAMtools/ VCFtools (67).
Velvet (68) and Velvet Optimizer (http://bioinformatics.net.au/software.velvetoptimiser.shtml) were used to de novo assemble the trimmed reads into contigs, producing 277 assemblies. Optimal k-mers fell between 53 bp and 97 bp, and the mean value for median contig size of genome assembly (n50) was over 928,000 bp. The mean longest contig was 1,067,000 bp, with 71 samples producing contigs that covered over half of the genome (greater than ϳ2.15 Mbp), and 16 samples assembled to contigs greater than 4 Mbp (equivalent to greater than 90% of the genome). Pipeline, post-, genetic, phylogenetic, phylogeographic, and cluster analyses were carried out using Perl, R, abacas, prokka, RaXML, Bayesian evolutionary analysis sampling trees (BEAST), and mclust software (69)(70)(71)(72)(73). A minor allele frequency (MAF) of less than 1% was used. To remove any SNPs that may be associated with recombination and which would mask the true phylogeny, SNPs within 1 bp of an insertion or deletion site were excluded from further analysis. We used BEAST (72) to produce an SNP phylogeny from the SNPs, as well as geographical and temporal data combined in phylogeographic analysis and mclust software for maximum likelihood cluster analysis.
To determine the MICs of 7/277 isolates, dilutions for the antibiotics chloramphenicol, rifampin, tetracycline, erythromycin, nalidixic acid, gentamicin, teicoplanin, and ampicillin were made as previously described (74). Briefly, 10 ml preequilibrated brain heart infusion broth, supplemented with yeast (Oxoid), L-cysteine (Sigma), and C. difficile supplement (Oxoid) (BHIS), were inoculated with three colonies of 48-h culture on BHIS agar plates. Once the optical density (OD) reached 0.3 nm, 24-well plates containing the antibiotic dilutions were inoculated with 1/100 of the BHIS broths and incubated. The ODs were measured 24 h postinoculation, and MIC data were categorized as susceptible, intermediate, and resistant by following the Clinical and Laboratory Standards Institute (CLSI) and the European Committee on Antimicrobial Susceptibility Testing (EUCAST) guidelines. The reference strain M68 was used as a control, as were appropriate negative controls.
Accession number(s). Sequence data were deposited in the European Nucleotide Archive under study accession number PRJEB11868.

ACKNOWLEDGMENTS
We acknowledge the Public Health Laboratory, London, United Kingdom, for help with PCR ribotyping. We thank David Harris at the WTSI for assistance with DNA sequencing.
The work was supported by the National Institute for Health Research (NIHR), the Wellcome Trust, and the Medical Research Council (grant reference MR/K000551/1). M.C. is funded by a doctoral research fellowship award from the NIHR. This report is independent research arising from a Chief Scientific Officer (CSO) Healthcare Scientist Award supported by the National Institute for Health Research and the CSO.
The views expressed in this publication are those of the author(s) and not necessarily those of the National Health Service, the NIHR, or the Department of Health. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. We have no conflicts of interest to report.