ABSTRACT
Enterotoxigenic Escherichia coli (ETEC) is a common cause of diarrhea among children living in and among travelers visiting developing countries. Human ETEC strains represent an epidemiologically and phenotypically diverse group of pathogens, and there is a need to identify natural groupings of these organisms that may help to explain this diversity. Here, we sought to identify most of the important human ETEC lineages that exist in the E. coli population, because strains that originate from the same lineage may also have inherited many of the same epidemiological and phenotypic traits. We performed multilocus sequence typing (MLST) on 1,019 ETEC isolates obtained from humans in different countries and analyzed the data against a backdrop of MLST data from 1,250 non-ETEC E. coli and eight ETEC isolates from pigs. A total of 42 different lineages were identified, 15 of which, representing 792 (78%) of the strains, were estimated to have emerged >900 years ago. Twenty of the lineages were represented in more than one country. There was evidence of extensive exchange of enterotoxin and colonization factor genes between different lineages. Human and porcine ETEC have probably emerged from the same ancestral ETEC lineage on at least three occasions. Our findings suggest that most ETEC strains circulating in the human population today originate from well-established, globally widespread ETEC lineages. Some of the more important lineages identified here may represent a smaller and more manageable target for the ongoing efforts to develop effective ETEC vaccines.
Enterotoxigenic Escherichia coli (ETEC) infections are an important cause of childhood diarrhea and diarrheal deaths among young children in developing countries (59) and of diarrhea among travelers to these countries (6, 54). Human ETEC strains are E. coli that produce one or more of three plasmid-encoded protein enterotoxins called human heat-stable toxin (STh or STaII), porcine heat-stable toxin (STp or STaI), and heat-labile toxin (LT or LT-I). The enterotoxins induce secretion of salts and water into the intestinal lumen (29). Many ETEC strains also produce surface appendages, called colonization factors (CFs), which help anchor the bacteria to the small intestinal wall (20). The toxins and all but one known CF are plasmid encoded (20, 34).
Human ETEC strains are phenotypically and epidemiologically diverse: more than 20 different CFs have thus far been described (20), the characterization of ETEC strains collected from different parts of the world has yielded 117 different serotypes (57), and some ETEC strains appear to be more pathogenic than others (9, 36, 46). This diversity poses a challenge for the ongoing efforts to develop effective ETEC vaccines (7). Many studies have shown that ETEC have emerged from E. coli on several occasions, probably through horizontal transfer of the enterotoxin-encoding virulence plasmids, and that some of these ETEC lineages appear to be widespread (4, 11, 31-33, 37, 38, 43, 47, 51). Because strains that originate from the same ETEC lineage may also have inherited many of the same epidemiological and phenotypic traits, identifying and defining these lineages may improve our understanding of the ETEC diversity and may lead to the identification of lineage-specific protective antigens that can be used in vaccines. To identify these lineages, we performed multilocus sequence typing (MLST) and phylogenetic analyses on a collection of ETEC strains that had been isolated from humans in different countries. We also estimated each lineage's age as a measure of how stable and well established these lineages are in the E. coli population. If the ancestral origin of the human ETEC population changes frequently, it would complicate efforts to identify new, chromosomally encoded antigens capable of inducing protective immune responses against ETEC. In that case, today's main vaccine development strategy of targeting plasmid-encoded virulence factors, such as the toxins and CFs, would probably continue to be the best approach for developing effective ETEC vaccines.
MATERIALS AND METHODS
Bacterial isolates.We define human ETEC to be any E. coli strain isolated from humans that encode one or more of the three enterotoxins STp, STh, and LT. We obtained and successfully performed MLST on 1,019 human ETEC strains from different parts of the world, except for two strains for which we used publicly available genomic sequences (3, 5, 8, 10, 12, 13, 18, 21, 23, 24, 26-28, 40-43, 46, 48, 51-53, 55, 58) (Table 1) . The strains originated from ≥13 different countries and are listed individually in Table S1 in the supplemental material. The 953 strains from Guinea-Bissau, Saudi Arabia/Egypt, India, and Vietnam were obtained during several epidemiological studies of diarrhea, while the remaining 66 strains were mainly isolated from individual cases of diarrhea or were chosen as representatives from larger strain collections. All strains had been isolated within the last 4 decades, and most were isolated between 1980 and 2000.
Human ETEC strains accepted into the study
The phylogenetic analyses were performed against a backdrop of MLST data from 1,250 non-ETEC E. coli and 8 ETEC isolates from pigs that were available from The Thomas S. Whittam Microbial Evolution Laboratory's strain collection at Michigan State University. The non-ETEC E. coli strains included 329 enterohemorrhagic, 245 Shiga toxin-producing, 213 enteropathogenic, 30 uropathogenic, 19 enteroaggregative, and 16 enteroinvasive E. coli strains; 124 Shigella strains; and 114 E. coli strains isolated from the environment and 41 from animals; as well as 119 epidemiologically poorly defined E. coli strains from humans. The eight porcine ETEC strains had been isolated from cases of diarrhea (three strains), edema (four strains), and septicemia (one strain).
Toxin and CF analyses.All human ETEC strains were screened for the presence of the enterotoxin structural genes by using a multiplex PCR assay (44), followed by testing for the presence of the STp, STh, and LT, as well as for 18 CFs: colonization factor antigen I (CFA/I), coli surface antigen 1 (CS1), CS2, CS3, CS4, CS5, CS6, CS7, CS8, CS12, CS13, CS14, CS15, CS17, CS18, CS19, CS21, and the CS22 structural genes by DNA-DNA colony hybridization as described elsewhere (45). Positive hybridization assay results were confirmed by repeating the hybridization assay. We defined strains that were negative for the 18 CFs for which we had an assay (45) as CF negative, although it is possible that they produce other known or hitherto-undescribed CFs.
MLST.We used the seven-gene (st7) MLST system of the EcMLST setup (www.shigatox.net/mlst ), which is based on sequencing PCR-amplified internal fragments of the aspC, clpX, fadD, icdA, lysP, mdh, and uidA housekeeping genes. Each fragment was sequenced in both directions and without active knowledge of each strain's country of origin, toxin profile, or CF profile. We designed and used in-house computer programs to process the sequences. Discrepancies between forward and reverse sequencing reactions, potential double peaks, and low-quality sequences were displayed on-screen for user confirmation. We performed a final sequence quality control by aligning the consensus sequences from all strains and controlling the correctness of base-calls of singletons, unique gaps and insertions, and the correctness of base-call differences between closely related strains, as was apparent in a maximum-likelihood tree.
The allele number for each gene fragment and MLST sequence type (ST) for each allele number combination was obtained through searches on the EcMLST system website. New alleles and allele profiles were assigned allele numbers and STs, respectively, by the EcMLST curator. The allele numbers and corresponding sequences for each identified human ETEC ST are found in Table S2 in the supplemental material.
Maximum-likelihood tree generation.To describe how the ETEC lineages are distributed in the E. coli population structure, we generated a phylogenetic tree based on the MLST sequence data from both ETEC and non-ETEC E. coli, where the positions of different ETEC lineages were visualized. We used a maximum-likelihood based method to generate the tree because these methods probably produce the most accurate phylogeny.
It should be kept in mind that these analyses were performed on a relatively small amount of each strain's total genomic content, which limits the accuracy of these analyses. Recombinational events that are not accounted for may distort the actual relationship among the strains. The phylogenetic tree, which is presented in Fig. 1, is mainly meant to provide an overview of how the ETEC lineages may be distributed through the E. coli population structure.
Distribution of human ETEC clonal groups in the E. coli population structure. The maximum-likelihood tree is based on 394 different MLST STs from 1,019 human ETEC, 8 porcine ETEC, and 1,250 non-ETEC E. coli isolates. Each bubble represents a unique ST, centered on its tree node. The area of each bubble is proportional to the number of human ETEC strains that had the given ST. Bubbles representing the same clonal group (CT) have the same color and, unless they overlap with each other, are connected by dotted lines. The long connector lines seen in CG4 and CG11 are probably effects of recombination. The numbers 1 to 42 indicate each of the 42 ETEC CGs. Known non-ETEC E. coli CGs indicated in the drawing are EHEC 1 (enterohemorrhagic E. coli O157:H7, e.g., strains EDL933 and Sakai), EHEC 2 (e.g., strains 11128 and 11368), EPEC 1 (typical enteropathogenic E. coli, e.g., strain E2348/69), EPEC 2 (e.g., strain B171), Shigella 1 (Shigella group 1, e.g., strains Sb227 and CDC3083-94), Shigella 2, Shigella 3 (e.g., strains Sf2457T, 301, and Sf8401), S. dysenteriae 1 (Shigella dysenteriae type 1, e.g., strain Sd197), S. sonnei (e.g., strain Ss046), and UTI 1 (uropathogenic E. coli, e.g., strain CFT073).
One representative DNA sequence of each ST from the ETEC and non-ETEC E. coli strains was included in the analyses. We used the maximum-likelihood-based ProposeModel command in the Treefinder software (G. Jobb [www.treefinder.de ]) to identify a suitable nucleotide substitution model for the data. Opting for a single model for the seven MLST genes combined, the best-suited model was a maximum-likelihood-based 10-parameter general-time-reversible (GTR) substitution model that allowed for gamma-distributed substitution rate heterogeneity. We performed 500 bootstraps where resampling was done across all genes and where the substitution rate heterogeneity and the GTR model parameters were optimized for each bootstrap sample. The tree in Fig. 1 represents the most commonly observed topology, where branch lengths have been averaged across all bootstrap trees. Other substitution models and estimating methods were also used to check that this estimated tree gave a reasonable representation of the data. These included ML-based methods in Treefinder where the substitution model was optimized for each of the seven MLST genes, as well as the less computationally intensive neighbor-joining-based maximum composite likelihood substitution methods provided in MEGA4 (49). We used MEGA4 for drawing trees and CorelDraw 11 (Corel Corp., Ottawa, Ontario, Canada) for annotating the figures.
Lineage identification.We define an ETEC clonal group to be an assemblage of ETEC strains that originated from the same ETEC lineage (i.e., as having a common ETEC ancestral origin) or to be represented by a single strain if the strain did not seem to share an ancestral origin with other ETEC included in the study. Two ETEC strains belong to the same clonal group if at least six of their seven sequenced MLST genes were identical to each other or to the genes of another ETEC belonging to the group, which was assessed by using the eBURST software, version 3 (14, 50). Two ETEC strains were also considered to belong to the same clonal group if they clustered together in the maximum-likelihood tree with a bootstrap support of ≥80%, and if the majority of the other strains that clustered with them were also ETEC.
The clonal groups were numerically named in descending order of the number of clonal group strains that did not originate from the Guinea-Bissau study and, subsequently, in descending order of total number of strains in the clonal group. The ranking weight of the Guinea-Bissau strain collection was thus reduced because the estimated pathogenicity of those strains varied (46).
Lineage age calculation.Estimating the age of each lineage enables us to assess how frequently we can expect new, major ETEC lineages to emerge from the E. coli population. For these analyses, we assume that mutations in the MLST genes that do not lead to amino acid changes (silent mutations, or synonymous substitutions) accumulate randomly across all seven MLST genes and at a constant rate (the synonymous substitution clock rate). By multiplying the number of synonymous substitutions with an accurate synonymous substitution clock rate, we can obtain a good estimate of the time since these strains last shared a common ancestor. This value is our lineage age estimate. We can estimate the number of synonymous substitutions that these strains have accumulated since they last shared a common ancestor with a fair amount of accuracy, but to estimate the synonymous substitution clock rate for ETEC we would need the gene sequences of two ETEC strains that shared a common ancestor a known number of years ago, preferably hundreds or thousands of years ago. Instead, we rely on a recent estimate that is based on analyses of two Vibrio cholerae strains that shared a common ancestor approximately 130 years ago (15).
For the lineage age analyses, we included one representative of each ST that contributed a unique combination of synonymous site nucleotide differences. We first identified and removed synonymous substitutions that had most likely been acquired through recombination. Because the rate of recombination is higher than that of mutations (19), failing to account for such events would lead to an over- or underestimation of lineage age. Assuming that synonymous substitutions from mutations would accumulate randomly across all MLST genes, we generated a Poisson probability distribution around the mean number of synonymous substitutions for the seven MLST gene fragments in a given ST (Poisson function, Excel 2003 [Microsoft Corp., Seattle, WA]). We then tested the probability of observing the given number of synonymous substitutions for each gene fragment in the ST and considered a left- or right-tail probability of <0.001 to indicate that the observed number of synonymous substitutions in the given gene had been acquired through recombination. We also attempted to identify recombination events by using RDP3 Beta 35 (25), and GENECONV, version 1.81a (S. A. Sawyer, GENECONV [http://www.math.wustl.edu/∼sawyer/geneconv/ ]), but these analyses did not yield any clear indication that any of the MLST genes had recombined.
We used DnaSP (version 4.50.3) (39) to calculate the Jukes-Cantor corrected average pairwise difference at synonymous sites (dS) (2), which is defined as the average number of nucleotide differences per site between gene sequences from any two randomly chosen STs. Exact Poisson 95% confidence limits for these pair-wise differences were calculated by using the CIPOISS macro in the SAS System, version 9.1 (SAS Institute, Inc., Cary, NC). The age of each lineage was estimated by dividing dS and its confidence limits by a synonymous substitution clock rate. We used the estimate published by Feng et al. of 0.97 synonymous substitutions caused by mutations/year/Vibrio cholerae genome (15). The V. cholerae M66-2 genome (GenBank accession nos. CP001233 and CP001234) has 3,693 open reading frames, comprising 3,464,121 coding nucleotides, 817,220 of which are synonymous sites (calculated by using DnaSP). Their rate thus generalizes to a clock rate of 0.97/817,220 = 1.187 × 10−6 synonymous substitutions/site/year.
RESULTS
Human ETEC strains that originate from the same ETEC lineage may have inherited many of the same epidemiological and phenotypic traits. Identifying and describing these lineages could therefore provide an additional basis on which to understand ETEC epidemiology and for identifying new vaccine antigens. We used MLST and phylogenetic analyses of strains isolated from different parts of the world to identify these lineages.
Toxin and CF analyses.The 1,019 human ETEC strains represented six different enterotoxin profiles: STp (n = 49 strains), STh (n = 183), LT (n = 537), STpLT (n = 107), SThLT (n = 141), and STpSThLT (n = 2), and 654 (64%) of the strains were positive for one or more CFs. The identified CFs were CFA/I (n = 71), CS1 (n = 26), CS2 (n = 59), CS3 (n = 104), CS4 (n = 7), CS5 (n = 44), CS6 (n = 178), CS7 (n = 31), CS8 (n = 8), CS12 (n = 31), CS13 (n = 49), CS14 (n = 52), CS17 (n = 41), CS18 (n = 71), CS19 (n = 19), and CS21 (n = 139) and were found in 51 different toxin-CF combinations. No strains were positive for CS15 or CS22. We did not have tests for detecting CS10, CS11, and CS20 genes.
The toxin-CF profile distribution of strains from the Guinea-Bissau study differed somewhat from that of the strains from the other studies. The 745 Bissau strains represented fewer distinct toxin-CF profiles than the 274 other strains (30 versus 48), the proportion of CF-negative strains was higher (321 [43%] versus 44 [16%]), the proportion of LT-only strains was higher (453 [61%] versus 84 [31%]), and the proportion of STh-positive strains was lower (181 [24%] versus 145 [53%]). The toxin and CF profiles for each human ETEC strain are listed in Table S1 in the supplemental material.
MLST analyses.The 1,019 human ETEC strains represented 105 different STs. Fifteen of the strains, representing four different STs, lacked one of the seven MLST genes. A total of 63 (60%) of the 105 STs were represented by >1 strain. The median number of strains for each ST was 2 (minimum, 1; interquartile range, 1 to 11; 90th percentile, 1 to 29; maximum, 107). The ST for each human ETEC strain is listed in Table S1 in the supplemental material. A total of 71 (68%) of the 105 STs comprised strains that displayed a single toxin-CF profile. The median number of different toxin-CF profiles represented by each ST was 1 (minimum, 1; interquartile range, 1 to 2; 90th percentile, 1 to 3; maximum, 19). A total of 26 (25%) of the STs were represented by strains from >1 country, and 21 (81%) of these represented >1 different toxin-CF profiles.
Combined, the MLST sequences from the 1,019 human ETEC, 8 porcine ETEC, and 1,250 non-ETEC E. coli strains represented 394 different STs. The eight porcine ETEC strains represented five different STs, three of which were also represented by human ETEC strains (no. of strains): ST171 (n = 3), ST86 (n = 2), and ST212 (n = 1). The 1,250 non-ETEC E. coli strains represented 309 different STs, 22 of which were also represented by human ETEC strains (no. of strains): ST171 (n = 30), ST140 (n = 15), ST89 (n = 13), ST223 (n = 10), ST230 (n = 8), ST34 (n = 6), ST129 (n = 6), ST148 (n = 6), ST134 (n = 4), ST117 (n = 3), ST86 (n = 3), ST274 (n = 3), ST627 (n = 3), ST88 (n = 2), ST92 (n = 2), ST168 (n = 2), ST127 (n = 2), ST656 (n = 1), ST165 (n = 1), ST388 (n = 1), ST461 (n = 1), and ST574 (n = 1). Upon testing these 123 presumed non-ETEC E. coli strains for the presence of the ETEC toxin genes, two turned out to be ETEC, both of which were LT positive, ST140, and originally reported as being enteropathogenic E. coli (EPEC). In all, 10 of the 15 ST140 non-ETEC E. coli strains were reported as being EPEC. The results from earlier testing for the EPEC virulence gene eae on the strains from Guinea-Bissau showed that, of 41 ST140 ETEC strains, all 14 LT-CF negative but none of the 27 STh-CFA/I strains were eae positive. The remaining non-ETEC E. coli strains that shared the same ST as the human ETEC appeared to represent a varied selection of different types of E. coli, with the notable exceptions that 12 of the 13 ST89 strains were Shiga toxin-producing E. coli and that 33 (29%) of the 114 environmental E. coli isolates were represented by 15 of these 22 STs.
Lineage identification and description.The 1,019 human ETEC strains could be grouped into 42 clonal groups (CGs), and the CGs appeared to be widespread throughout the E. coli population structure (Fig. 1). The eBURST analysis by itself yielded 19 eBURST groups and 27 singletons, while results from the bootstrap analyses contributed to consolidating ST86, ST741, and ST786 to CG2, ST769 to CG11, and ST770 to CG12 (Fig. 2 and Fig. 3). The CG designation for each human ETEC strain is listed in Table S1 in the supplemental material.
Composition of human ETEC clonal groups 1 to 8. Each clonal group (CG) comprises ≥10 strains that do not originate from Guinea-Bissau. The CGs were identified through a combination of eBURST analyses and maximum-likelihood bootstrap analyses. In each CG, strains with different MLST ST-toxin/colonization factor (CF) combinations are listed separately. Each bubble represents a unique ST-toxin/CF-origin combination, and the area of each bubble is proportional to the number of strains that have this ST-toxin/CF-origin combination. Different colored bubbles are used for depicting different CGs, and each CG color combination matches those used in Fig. 1. The topology is taken from the maximum-likelihood bootstrap consensus tree, where connected entries share a ≥50% bootstrap support (≥80% for entries flagged with an asterisk). The legend for the bubble sizes is found in Fig. 1. ETEC origin: 1, Guinea-Bissau; 2, Saudi Arabia and Egypt; 3, India; 4, all other countries.
Composition of human ETEC clonal groups 9 to 42. Each clonal group comprises <10 strains that do not originate from Guinea-Bissau. See the legend for Fig. 2 for further explanations.
Thirty-one (74%) of the 42 CGs were represented by >1 strain. The median number of strains for each CG was 9 (minimum, 1; interquartile range, 1 to 32; 90th percentile, 1 to 65; maximum, 141). Twenty (48%) of the CGs comprised strains from >1 country (Fig. 2 and 3). These CGs represented 855 (84%) of all analyzed strains and 267 (97%) of the strains that did not originate from Guinea-Bissau. Of the 22 CGs represented by strains from a single country (Fig. 3), 15 (68%) were formed by strains from Guinea-Bissau.
The STh, STp, and LT genes were found in 14, 22, and 36 of the 42 CGs, respectively. Of the 16 different types of CFs that were represented in this strain material, all except CS2 and CS8, which were only present in CG1, were found in ≥2 CGs. Each identified CF was represented in a median of four CGs (minimum, 1; interquartile range, 2 to 6.25; 90th percentile, 1.5 to 8; maximum, 15). Of the 326 STh-positive strains included in the study, 313 (96%) were found in CG1 to CG8 (Fig. 2). Nineteen (45%) of the CGs represented strains that had ≥2 toxin profiles, and 25 (60%) of the CGs represented strains that had ≥2 toxin-CF profiles. The median number of different toxin-CF profiles for each CG was 2 (minimum, 1; interquartile range, 1 to 4.75; 90th percentile, 1 to 6; maximum, 21).
Nineteen (45%) of the 42 CGs comprised strains representing ≥2 different STs. The median number of different STs for each CG was 1 (minimum, 1; interquartile range, 1 to 3; 90th percentile, 1 to 6; maximum, 11).
Six of the eight porcine ETEC could be grouped together with human ETEC strains: three O147 F18 (or F107) STa STb-positive strains of ST171 belonged to CG1, one O157:H43 and one O8:K87 88ab:H19 isolate of ST86 belonged to CG2, and one O9:K103:NM F6 (or 987P) STa STb-positive strain of ST212 belonged to CG22. The remaining two isolates, which included one O138 isolate (ST683) and one O147 F18 STa STb-positive isolate (ST-15), could be grouped together into a separate CG not named here.
Lineage age.Of the 3,753 nucleotides present in the MLST gene sequence alignment, ∼920 were synonymous sites, in the sense that any nucleotide changes to these sites would not result in amino acid changes in the gene product. We identified seven probable gene recombination events, including clpX in ST277, ST704, and ST727 (CG2); fadD in ST703 (CG4); aspC in ST726 (CG5); clpX in ST274 (CG6); fadD in ST223, ST230, ST719, ST721, and ST749 (CG11); icdA in ST223 (CG11); and mdh in ST754 (CG28), which lead to the exclusion of 14, 8, 5, 5, 9, 13, and 5 synonymous substitutions, respectively, from the age calculations. We rechecked the original sequence chromatograms to make sure that the remaining synonymous substitutions were correctly called.
The estimated time since the emergence of each lineage ranged from the time the first strain in a CG was isolated (shown as zero years ago) to over 30,000 years ago. (Table 2). Twenty-seven of the lineages represented strains that had no variation in the synonymous sites and therefore had an estimated age of zero years. Of the 15 remaining CGs, which represented strains with one or more synonymous substitutions, the median estimated age was 1,222 (interquartile range, 918 to 1,588) years (Table 2). These lineages comprised 792 (78%) of the strains included in the study, and 237 (87%) of the strains that did not stem from the Guinea-Bissau study.
Human ETEC lineage age estimates
DISCUSSION
This is the first attempt to identify, characterize, and rank the human ETEC lineages that exist in the E. coli population. The results from the initial analyses of the 42 lineages we identified offers several new insights about the human ETEC population, including that most ETEC having infected humans these last few decades have probably originated from well-established and globally widespread ETEC lineages. There appears to be extensive movement of the plasmid-encoded toxin and CF genes between these ETEC lineages, and porcine and human ETEC strains seem often to have a shared ETEC ancestry.
Results from previous fingerprinting studies have already shown that ETEC strains that have the same serotype, toxin, or CFs are often closely related (4, 11, 31-33, 43, 47). The fingerprinting methods used in those studies are designed to highlight rapidly occurring genomic changes that tend to be poorly understood but which are useful targets for identifying strains that share a recent common ancestor. Most (95%) CF-positive ETEC strains from the Guinea-Bissau study that clustered together by fingerprinting (47) also grouped together by MLST, suggesting excellent congruence between the typing methods. However, several clusters that appeared discrete by fingerprinting were found to group together by MLST, which is consistent with the greater discriminatory power of PCR-based fingerprinting over sequence-based methods. The three other studies that have used phylogenetic methods that are best suited for assessing long-term ancestral relatedness of bacteria, including MLST and multilocus enzyme electrophoresis, have all shown that ETEC have emerged from E. coli on several occasions (37, 38, 51). Apart from this finding, too few strains were probably included in these studies to reveal a clear picture of the ETEC population structure.
Because E. coli has a low rate of chromosomal gene recombination (30, 35), the eBURST analyses we used should be reliable at identifying ETEC lineages (50). The added use of maximum-likelihood bootstrap analyses was added to help identifying closely related ETEC strains that could not be grouped by the eBURST analyses alone. The bootstrap method is probably more error-prone than the eBURST-based method because intermediately related non-ETEC E. coli strains need to be included in the analyses to avoid grouping ETEC strains that have originated from separate lineages. Only strains representing five STs were added to CGs based on bootstrap support alone, so this is probably not a large problem in the present study.
There is currently a debate about which models and synonymous substitution rates should be used to estimate the age of bacterial lineages (1, 22). We used the most recent and conservative synonymous substitution rate for V. cholerae, which should be suitable for estimating the age of relatively young lineages (15). We do not know, however, whether the V. cholerae substitution rate is comparable to that of ETEC strains or to that of E. coli strains in general. Had we used the rates suggested by Guttman and Dykhuizen (19) and by Whittam (56), which may be more suitable for estimating ages of older lineages (15, 22), our age estimates would be 40 and 198 times higher, respectively.
In addition to using the correct model and accounting for recombinations, the accuracy of the age estimate is dependent on the genetic diversity of the lineage being represented among the strains that are being analyzed. Many of the apparently new ETEC lineages observed here could, with even better strain representation, turn out to be older lineages. Failing to group strains from the same lineage or grouping strains from different lineages would also lead to inaccurate age estimates. CG12, for example, appears to be over 10 times older than the second oldest lineage. This lineage could represent an archetype ETEC or wrongly grouped distinct lineages. As more ETEC and non-ETEC E. coli strains are added to the analyses, the accuracy of the age estimates, as well as the correctness of the lineage compositions, will improve.
Some of the identified ETEC STs were shared with non-ETEC E. coli strains. A likely explanation for this overlap is that the seven-gene MLST scheme does not always provide sufficient resolution to separate strains from different E. coli lineages or between ETEC and the ancestral conditions that preceded acquisition of the ETEC plasmids. It is also possible that some of these strains are ETEC strains that have lost their ETEC virulence plasmids or that they represent parts of ETEC lineages that have adapted to survive without the ETEC virulence plasmids. The seven-gene MLST scheme we used may not always provide an adequate phylogenetic signal to enable accurate estimation of strain relatedness. This is particularly a problem when comparing strains that are distantly related. The accuracy of phylogenetic trees like that shown in Fig. 1 would therefore improve by including additional housekeeping gene sequences in the analyses.
Our method for naming and ranking CGs was used to better reflect each lineage's relative contribution to human disease, where the groups that comprise the most strains that had been isolated mainly from diarrheal cases were considered most important. We thereby reduced the focus on strains from the Guinea-Bissau study because these included strains of various degrees of pathogenicity (46). The rank position of the top 18 clonal groups (CG1 to CG18) would have remained the same even if we had excluded the Guinea-Bissau strains from the ranking. We do not expect the ranking to be accurate for the higher-numbered CGs, because these CGs were identified based on a small number of strains and on strains from the Bissau study alone (CG28 to CG42). The finding that close to all STh-positive strains, which arguably represent some of the most pathogenic types of ETEC (36, 46), including in the Guinea-Bissau study (46), were limited to the top eight CGs suggests that the ranking does reflect the disease burden contribution of strains from the different lineages to a certain degree. The composition and the ranking of the lineages will probably change somewhat, and new lineages will be identified as data from more strains from other parts of the world are included in the analyses.
There was a considerable overlap between the CGs represented by the Guinea-Bissau strains and the other strains included in the study, which suggests that the strains from Guinea-Bissau do not originate from a unique population of ETEC lineages. This finding offers hope that the ancestral origins of human ETEC that circulate in areas of endemicity may not differ much between different geographical regions or human populations and that lineage-specific ETEC vaccines that are effective across different regions and human populations may be developed.
The toxin and CF genes appeared to be spread across several different lineages, suggesting that they were acquired through horizontal gene transfer. Most likely, the genes were spread through horizontal transfer of ETEC virulence plasmids, on which the toxins and CFs, except CS2, are encoded (20, 34), because earlier studies have shown that some of these plasmids are easily transferred between different E. coli (60). This notion is supported by the finding that the supposed chromosomally encoded CS2 gene was, together with CS8, the only CF only represented in one lineage. Little is known about the ETEC plasmid population, however, except that the plasmids may contain a large number of transposable elements, which causes the plasmids to recombine and evolve rapidly (17), and that chromosomally encoded genes may be needed to properly regulate plasmid expression (16). Further studies are needed to investigate the nature and rates of toxin and CF gene exchange between the lineages identified here.
The finding that strains from the same lineage may have different host specificities (humans and pigs) supports the results of Turner et al. (51), who reported two STs representing ETEC strains isolated from both humans and cattle. Together with the fact that three of the four lineages in the present study that represented porcine ETEC strains were shared by human ETEC strains, this suggests that changes in host specificity often occurs after the lineage has been established.
In conclusion, we identified 42 different human ETEC lineages and found that most ETEC infecting humans today have probably originated from well-established and globally widespread ETEC lineages. Further analyses are needed to estimate the rate with which new lineages are being established in the E. coli population, to describe the nature and dynamics of the toxin and CF gene exchange and to describe the within-lineage evolution of ETEC, including identifying events that contribute to increased pathogenicity, change in host specificity, and change in antigenic properties of the strains. Because strains that share a ancestry may also have inherited many of the same biological properties, including antigens capable of inducing protection, the population structure map presented here may provide a useful basis for interpreting the epidemiological diversity of ETEC infections and for developing lineage-specific vaccines.
ACKNOWLEDGMENTS
This study was supported in part by the Food and Waterborne Integrated Research Network Microbiology Research Unit under National Institutes of Health (www.nih.gov ) research contract N01-AI-30058 (to T. S. Whittam); the Global Health and Vaccination (GLOBVAC) Research Programme under Research Council of Norway (www.rcn.no ) contract 185872/S50 (salary to H. Steinsland); the Faculty of Medicine and Dentistry (www.uib.no/mofa/ ), University of Bergen, Bergen, Norway (salary to H. Steinsland); and Kaia and Arne Nævdals Fund, Bergen, Norway, research grant 470940 (to H. Steinsland).
We thank the Research Technology Support Facility at Michigan State University, in particular Shari Tjugum-Holland, for prompt and excellent handling and processing of sequencing reactions. We also thank Jan Schouten at MRC-Holland for his support and the contribution of reagents to the study. It would not have been possible to undertake this study without the kind contribution of strains. We also thank the Guinea-Bissau childhood diarrhea research team; Marcia Wolf at the Walter Reed Army Institute of Research; and Maharaj K. Bhan, Stephen J. Savarino, Ann-Mari Svennerholm, Trung Vu Nguyen, Andrej Weintraub, Moira M. McConnell, Erik A. Elsinghorst, Ana C. P. Vicente, Frederick J. Cassels, Ian Hendersson, Gloria Viboud, and Arlette Darfeuille-Michaud; as well as the Centers for Disease Control and Prevention and Michigan Department of Community Health for contributing ETEC strains.
FOOTNOTES
- Received 14 December 2009.
- Returned for modification 18 January 2010.
- Accepted 29 May 2010.
- Copyright © 2010 American Society for Microbiology