Previous Article | Next Article ![]()
Journal of Clinical Microbiology, December 2004, p. 5502-5511, Vol. 42, No. 12
0095-1137/04/$08.00+0 DOI: 10.1128/JCM.42.12.5502-5511.2004
Copyright © 2004, American Society for Microbiology. All Rights Reserved.
Department of Medicine and Epidemiology,1 Department of Population Health and Reproduction, School of Veterinary Medicine, University of California, Davis, California,3 Department of Veterinary and Biomedical Sciences, College of Veterinary Medicine, University of Minnesota, St. Paul, Minnesota2
Received 24 December 2003/ Returned for modification 10 February 2004/ Accepted 17 August 2004
|
|
|---|
|
|
|---|
The digestion of DNA by restriction endonucleases (REs) is one of the most commonly used DNA fingerprinting techniques, and specifically, pulsed-field gel electrophoresis (PFGE) is the primary method that uses REs with bacteria (3, 4, 28). PFGE involves the digestion of chromosomal DNA by specific REs to create large restriction fragments, typically in the range of 10 to 800 kb (4, 28). Electrophoresis of these fragments allows the visualization of a restriction fragment pattern (RFP) that comprises a series of bands, with each band representing a sized piece of DNA. The relationship between bacterial isolates is inferred by the similarities of the RFPs.
RFPs are primarily evaluated by two methods. The first assesses the relatedness of bacterial strains by determining the number of band differences between each pair of isolates (27, 28). The guidelines for this analysis are intended only to assess epidemiologically related strains, as would occur during an outbreak investigation. The interpretation of the number of band differences between a pair of isolates is based on the minimum number of genetic mutational events that would result in the observed number of band differences. For example, two isolates that differ by two to three bands would be considered closely related because a single genetic event can explain this difference.
The second analytic method is calculation of the band-sharing similarity coefficients, which represent continuous rather than categorical measures of relatedness. Briefly, each organism within the population of isolates being studied generates an RFP. The RFP for each isolate is then compared in a pairwise fashion to that for another isolate, and the number of bands shared by each pair of isolates is calculated. The number of bands in each RFP and the number of shared bands are then used to calculate the band-sharing coefficient. Ultimately, a matrix of band-sharing coefficients between all pairwise comparisons of isolates is used in a cluster analysis, and a rooted dendrogram that graphically depicts the relatedness of organisms can be produced.
The uses of PFGE in DNA fingerprinting are much broader than the simple assessment of the relationships of outbreak strains. PFGE is widely used to compare bacterial isolates collected over variable spatial and temporal scales. For example, the National Molecular Subtyping Network for Foodborne Disease Surveillance (PulseNet), sponsored by the Centers for Disease Control and Prevention, analyzes bacterial isolates from many laboratories in the United States as well as Canada (26). The objective is to rapidly assess the DNA fingerprints of isolates from disease outbreaks and follow-up isolates, even if the cases are geographically and temporally unrelated. Given the importance of the accurate assessment of the relationships of these isolates, particularly when distance and time separate the isolate sources, it is critical to have a thorough understanding of the potential biases inherent in PFGE data collection and analysis.
In practice, the use of PFGE as a DNA fingerprinting technique requires many subjective decisions to be made. This subjectivity increases the variability of the results among studies and, consequently, affects how those results are interpreted. Some of these decisions include selection of the specific RE and the number of different REs to be used, determination of the numbers and positions of the bands on the gel, determination of which bands are different or identical between different isolates, and the analytical techniques selected to assess the relatedness of isolates.
A number of methods for the analysis of PFGE data are available. For example, the software package BioNumerics (Applied Maths, Inc., Austin, Tex.) contains different algorithms for the importation of a gel image and normalization of the lanes in the gel and can be used to assess the similarity among the isolates and to construct dendrograms. While this affords the investigator flexibility in analyzing the data, it also engenders confusion about the use of different analytical techniques and their relationship to one another. This leads to uncertainty about the utility of one or more enzymes, which similarity (or dissimilarity) coefficients should be used, how misclassification errors should be accounted for during the process of band matching, and how inferences about the relatedness of isolates should be made by use of the analytical techniques chosen. The objectives of this study were (i) to compare the results of two analytical techniques commonly used with PFGE with populations of isolates for which the entire genetic sequence is known and (ii) to assess the improvement in interpretation when different numbers and combinations of enzymes are used for each isolate.
|
|
|---|
The first population (the outbreak population) was used for simulation of an epidemiologic trace-back investigation in which the reference E. coli sequence was the outbreak strain (27, 28). This population consisted of the reference isolate plus an additional 16 isolates. The 16 additional isolates were independently generated from the initial reference isolate as described above. In this way, each isolate was unique and had a predetermined expected similarity to the reference isolate. We created four sets of four isolates in which each isolate differed on average from the reference isolate by 0.05, 0.1, 0.25, and 0.5%, respectively. The relationship between these isolates is shown in Fig. 1A.
![]() View larger version (28K): [in a new window] |
FIG. 1. Relationships of the isolates in the two populations. The outbreak population (A) consisted of the reference isolate plus an additional 16 isolates (17 isolates in total), each of which differed from the reference isolate by a certain amount. The percentage represents the probability of mutation of each base position and, therefore, is approximately equal to the overall sequence difference between the reference isolate and the simulated isolate. The ecological population (B) consisted of the reference isolate, 6 isolates that were simulated from the reference isolate, 2 isolates that were simulated for each of the isolates simulated in the first step, and then an additional 2 isolates that were simulated from the isolates at the second step (43 isolates in total). The complete branching structure is shown only for isolate C1 but was identical for all isolates, isolates A through F.
|
The number of base differences between each pair of isolates was calculated by using the program written in Visual Basic. This allowed the sequence similarity between each pair of isolates to be calculated. In this calculation, the similarity between each isolate and the reference isolate would be expected to be very close to the predetermined probability of random mutation. However, the similarity between each of the simulated isolates was more uncertain. The sequence dissimilarity was calculated as the number of base differences between the pair of isolates divided by the total number of bases (which was fixed due to the absence of insertions and deletions). One minus the dissimilarity provided the sequence similarity, which served as the "gold standard" of the similarity between each pair of isolates and as the reference coefficient against which all other similarity coefficients were compared.
The digestion of each isolate with three different REs was simulated by using the known properties of three enzymes: XbaI (T
CTAGA), NotI (CG
GGCCGC), and SfiI (GGCCNNNN
NGGCC). These three REs were chosen because they are frequently used to digest E. coli for PFGE studies and because each results in a different number of expected bands per isolate (23, 24, 27, 28). XbaI recognizes a sequence of 6 bp (TCTAGA), while NotI (GCGGCCGC) and SfiI (GGCCNNNNNGGCC, where N is any nucleotide) each recognize a sequence of 8 bp. All simulated digestions were made with a program written in Visual Basic. With this program, the restriction fragments (sizes and nucleic acid contents) for each isolate and each enzyme were determined. This information was saved in a database (Microsoft Access; Microsoft Corp.).
Defining the RFP for each isolate. The RFP of an isolate was defined by using four approaches. The first approach (the COMP approach) defined an RFP by using the complete set of restriction fragments generated in the digestion. In addition, the comparison of isolates in the data set used for the COMP approach (the COMP data set) required that matching fragments contain the same number of nucleotides (exact size) with perfect sequence identity (same region of the genome). The second approach (the REST approach) defined an RFP by restricting the fragments that were analyzed to those that were greater than 25 kb and less than 700 kb. In practice it is common to use a minimum-size cutoff (e.g., 25 kb) to eliminate the possibility of including plasmid DNA in the analyses (28). The 700-kb cutoff was applied because typical electrophoresis conditions do not allow the larger bands to migrate far enough into the gel to be resolved. The comparison of the REST data set also required a perfect size and sequence match between fragments of different isolates. These two methods were used to generate data sets for both the outbreak and the ecological populations of isolates by using the three REs separately and in combination.
The third and fourth approaches incorporated two sources of error inherent in PFGE analyses (12, 28, 30). The first error occurs because multiple restriction fragments of approximately the same size may exist for a single isolate but are counted as a single fragment. This superimposition of bands is an intraisolate or an intralane type of error. In this analysis, if two fragments of an isolate possessed a relative size difference of less than 5%, they were considered a single band. The second error occurs because bands of similar sizes among different isolates are counted as identical bands, regardless of their genomic contents. This is an interisolate or an interlane type of error. In this analysis, if two bands for different isolates possessed a relative size difference of less than 5%, they were considered matching bands.
By using the outbreak and the ecological populations of isolates, these errors formed the basis of the third and fourth approaches for defining an RFP. The third approach (the IMP-COMP approach) used all of the restriction fragments used in the COMP approach, while the fourth approach (the IMP-REST approach) used the restricted fragment sets used in the REST approach. It was expected that these two misclassification errors would result in an underestimation of the diversity of the isolate set and a misclassification of the relationships among the isolates.
Assessments of similarity between isolates. Two analytic methods were used to assess the relationship between isolates within a data set. The first was a qualitative method based on the number of genetic mutational events required to produce specific differences in PFGE patterns (28). The method classifies groups of isolates into four categories: indistinguishable, closely related, possibly related, and different. Isolates with no band differences are classified as "indistinguishable." "Closely related" isolates exhibit two to three band differences. This category suggests either that a single genetic event occurred within a restriction site and resulted in either the loss or the gain of a restriction site (three band differences) or that an insertion or deletion of genetic material changed the size of the restriction fragment (two band differences). "Possibly related" isolates exhibit four to six band differences, a theoretical result of two genetic mutational events. "Different" isolates exhibit more than six band differences. Although this method was developed to compare isolates within an outbreak that spans a narrow temporal window, the method has been used inappropriately to compare populations of unrelated isolates (14, 19, 25, 31). We applied this method only to the REST and IMP-REST data sets for the outbreak and ecological populations. By this method, each isolate was compared to the reference strain, resulting in 16 and 42 comparisons in the outbreak and ecological populations, respectively.
The second method used the entire restriction fragment information generated from each enzyme for each isolate to calculate similarity indices by using the Dice coefficient. The Dice coefficient (SD) (7) is calculated as [2(nAB)]/(nA + nB), where nAB is the number of bands common to isolates A and B, nA is the total number of bands for isolate A, and nB is the total number of bands for isolate B.
Dice coefficients were calculated for the pairwise comparisons of all isolates within a data set for each enzyme. Coefficients were then calculated for all combinations of the enzymes. For the multiple-enzyme coefficients, the total number of matching bands for each enzyme comprised the numerator, while the denominator consisted of the total number of bands for each isolate and each enzyme.
Assessment of analytical techniques. In order to determine the fidelity of the band-sharing coefficients to the gold standard of sequence similarity, the lower diagonal matrices of the pairwise band-sharing coefficients were compared to the lower diagonal matrices of pairwise sequence similarities. There were 136 and 903 pairwise comparisons for the outbreak and ecological populations, respectively. The correlations between the matrices were calculated by using Mantel's randomization test (17), with P values estimated by using 5,000 permutations.
Dendrograms were constructed only as a visual aid to depict the relationship between isolates for each of the populations and analyses. First, a dendrogram was constructed for each population of isolates by using the entire sequence data for each isolate in the population. A second dendrogram was constructed by using the band-sharing coefficient data from all analyses. The unweighted pair group method with average linkages (UPGMA) was used with the program PHYLIP Neighbor (11). All dendrograms were then visualized with the software TREEVIEW (20).
|
|
|---|
Relationships within simulated E. coli populations. For the simulated populations of E. coli isolates, dendrograms based on the sequence similarity matrices for each population were constructed (Fig. 2 and 3). The similarity matrices generated in these analyses were the gold standard to which subsequent band-sharing analyses were compared. The dendrograms were not used for analysis; they were used only to visually compare the inferred relationships of the isolates.
![]() View larger version (14K): [in a new window] |
FIG. 2. Dendrogram depicting the relatedness of the outbreak population of isolates. Relationships are based on the sequence similarity of isolates, and the dendrogram was generated by UPGMA clustering. REF, reference isolate.
|
![]() View larger version (18K): [in a new window] |
FIG. 3. Dendrogram depicting the relatedness of the ecological population of isolates. Relationships are based on the sequence similarity of isolates, and the dendrogram was generated by UPGMA clustering. REF, reference isolate.
|
|
View this table: [in a new window] |
TABLE 1. Number of pairwise comparisons that fell within each of the band difference categories as set by the PFGE guidelinesa
|
|
View this table: [in a new window] |
TABLE 2. Number of pairwise comparisons that fell within each of the band difference categories as set by the PFGE guidelinesa
|
The relationships inferred from the analyses with the imperfectly matched data sets were not as well correlated to the true relationships as the relationships inferred from the analyses with the perfectly matched data sets. Many of the indistinguishable and closely related isolates in these analyses were distantly related to the reference strain, especially the isolates in the ecological population. Many of the analyses added isolates to the indistinguishable and closely related categories that were more distantly related to the reference strain than the relationship inferred in the perfectly matched analyses (Tables 1 and 2). The indistinguishable category, however, was more consistent in the outbreak population than in the ecological population of isolates; only one additional isolate was in this category for two of the digestions for the outbreak population by the IMP-REST approach, whereas seven and eight additional isolates were included in this category for two of the enzyme digestions for the ecological population. In general, the proportion of isolates considered indistinguishable or closely related was inversely related to the number of fragments produced by the enzyme digestion.
Quantitative comparisons. Dendrograms based on the band-sharing similarity coefficients for both the outbreak and the ecological populations were created. A set of dendrograms obtained for different endonuclease digestions with the REST and IMP-REST data sets is shown (Fig. 4 to 7). In addition, dendrograms representing the results of the multiple-enzyme digestions are also shown (Fig. 4 to 7). The dendrograms from these data sets were compared to the dendrograms created from the sequence data in order to visualize the inferred relationships among the isolates in each analysis. The Mantel correlation coefficients for each band-sharing similarity coefficient matrix with the sequence similarity matrix are shown for the outbreak and the ecological populations (Fig. 8).
![]() View larger version (23K): [in a new window] |
FIG. 4. Dendrograms depicting the relatedness of the outbreak population of isolates and perfect matching. The four different analyses included digestion of the REST data set with XbaI (A); digestion of the REST data set with NotI (B); digestion of the REST data set with SfiI (C); and digestion of the REST data set with XbaI, NotI, and SfiI (D). Relationships are based on the matrices of band-sharing similarity coefficients among isolates, and the dendrogram was generated by UPGMA clustering. REF, reference isolate.
|
![]() View larger version (30K): [in a new window] |
FIG. 7. Dendrograms depicting the relatedness of the ecological population of isolates and imperfect matching. The four different analyses included digestion of the IMP-REST data set with XbaI (A); digestion of the IMP-REST data set with NotI (B); digestion of the IMP-REST data set with SfiI (C); and digestion of the IMP-REST data set with XbaI, NotI, and SfiI (D). Relationships are based on the matrices of band-sharing similarity coefficients among isolates, and the dendrogram was generated by UPGMA clustering. REF, reference isolate.
|
![]() View larger version (62K): [in a new window] |
FIG. 8. Mantel's randomization test correlation coefficients for the complete and restricted analyses of each enzyme combination with the outbreak population (A) and the ecological population (B). All correlation coefficients were statistically significant (P < 0.0005).
|
![]() View larger version (24K): [in a new window] |
FIG. 5. Dendrograms depicting the relatedness of the outbreak population of isolates and imperfect matching. The four different analyses included digestion of the IMP-REST data set with XbaI (A); digestion of the IMP-REST data set with NotI (B); digestion of the IMP-REST data set with SfiI (C); and digestion of the IMP-REST data set with XbaI, NotI, and SfiI (D). Relationships are based on the matrices of band-sharing similarity coefficients among isolates, and the dendrogram was generated by UPGMA clustering. REF, reference isolate.
|
|
|
|---|
Even under the unrealistic constraints of perfect matching, PFGE assessments did not re-create the phylogenetic relationships of the simulated populations. This difference was even more pronounced in the IMP-REST data sets, which represented the more relevant and realistic type of PFGE data that are being generated and analyzed in the laboratory. Consequently, the results and interpretations of the REST and IMP-REST data sets were emphasized in this study.
The correlation of the true phylogeny and that predicted by PFGE depended on the choice of enzyme or enzymes and analytic method. Fidelity between the phylogeny predicted by PFGE and the true phylogeny improved with the use of multiple enzymes (Fig. 8). However, the use of multiple enzymes can be costly in terms of time and money. If a single enzyme were to be used with the 17 isolates in the outbreak population, the decision of which enzyme to use would have greatly affected the epidemiological inferences. When XbaI was used in the perfectly matched REST analysis (Table 1), only two of the isolates would have been considered indistinguishable, none of the isolates would have been considered closely related, and three would have been possibly related. These numbers changed dramatically when one of the enzymes that recognized a sequence of 8 bp was used. The same patterns were observed in the imperfectly matched analyses. If these isolates were collected as part of a trace back during a food-borne outbreak, the choice of enzyme would have directly influenced our assessment of which isolates were part of the outbreak, and thus, the choice of enzyme could have serious repercussions regarding the identification of the source of the pathogen.
In RFP analyses the subjective process of band determination is critical and cannot be ignored (12, 28, 30). Consequently, we incorporated various types of errors into our analyses. The divergence from the true phylogeny became more severe as errors of content and determination were simulated. For example, we ignored the genetic contents of the restriction fragments and accounted only for the sizes of the fragments. Many bands that were generated by the three enzymes used in this study were of the same relative size but were derived from different segments of the genome. These superimposed bands would be difficult to distinguish by standard PFGE protocols. This finding was documented in a PFGE analysis of E. coli O157 by Davis et al. (6). In addition, in many situations two isolates had bands of almost the same size but originated from completely different segments of the genome. This would result in the false assignment of a band match between the two isolates. Davis et al. (6) also documented this type of error in the previously mentioned study with E. coli O157. This error is similar to the user-specified tolerance factor that is incorporated into many of the DNA fingerprint analysis software packages (5, 10, 12). The tolerance factor, however, is based on a difference in linear position on the gel rather than on the strict difference in fragment size.
When imperfect matching error was incorporated into the analyses, isolates appeared to be more similar to each other (Fig. 5 and 7). In the dendrograms constructed with imperfect matching (Fig. 5 and 7), all of the isolates were more similar to each other than they were in the dendrograms that used perfect matching (Fig. 4 and 6). If the number of isolates that have at least 80% similarity with the reference strain is tabulated for each enzyme analysis, this number is consistently higher in the imperfectly matched (IMP) data sets than in the perfectly matched datasets. The correlations between the similarity coefficients and the underlying sequence data become dramatically reduced when imperfect matching is used. Finally, the number of isolates that were indistinguishable in the qualitative analyses increased in the imperfectly matched analysis compared to the number in the perfectly matched analysis (Tables 1 and 2). This finding was most noticeable for the ecological population. Although the criteria of Tenover et al. (27, 28) were not intended to be used in diversity analyses (as illustrated by the ecological population), many researchers inappropriately continue to do so (14, 19, 25, 31).
![]() View larger version (28K): [in a new window] |
FIG. 6. Dendrograms depicting the relatedness of the ecological population of isolates and perfect matching. The four different analyses included digestion of the REST data set with XbaI (A); digestion of the REST data set with NotI (B); digestion of the REST data set with SfiI (C); and digestion of the REST data set with XbaI, NotI, and SfiI (D). Relationships are based on the matrices of band-sharing similarity coefficients among isolates, and the dendrogram was generated by UPGMA clustering. REF, reference isolate.
|
We observed that the enzyme that produced the most fragments, XbaI, had the highest fidelity with the sequence data in all analyses. However, the more fragments that an enzyme produces, the more chances there are for misclassification errors (6). When the imperfectly matched analyses were performed, XbaI and SfiI, both of which generated many fragments, had the highest proportionate decreases in correlation between band-sharing similarity and sequence similarity. The use of a frequently cutting enzyme may make gels difficult to interpret accurately.
The creation of the isolates in this study was done by using random point mutations throughout the genome. As described above, the probabilities of mutation were chosen with the goal being a desired probability of obtaining a mutation within a restriction site. Because each restriction enzyme had approximately 200 bases in targeted restriction sites, a 0.05% probability of random mutation provided an approximately 10% probability of a point mutation within a restriction site. In addition, these random point mutations provided the possibility of creating new restriction sites within the genome. In the course of randomly assigning base mutations, we did not account for conservative versus variable regions in the genome. All bases had the same probability of mutating, and if a mutation occurred, all three remaining bases had an equal probability of replacing the original base. Again, this is not realistic, and in a recent study of E. coli sequence evolution over extended periods, point mutations were rare (16). If we were to use these calculations (9, 16), then we would require more than 106 generations to achieve the 0.05% mutation probability. We did not incorporate insertions, deletions, or other mutational events into the model. In reality, insertions and deletions may be responsible for the majority of RFP differences among isolates. For example, in a study of E. coli O157 (15), the majority of differences in XbaI PFGE patterns were due to insertions and deletions. These insertions and deletions are likely to result in more frequent errors in band matching. Our use of point mutations might represent a conservative illustration of the difficulties in assessing genetic relationships through PFGE. While the assumptions of this model do not reflect the dynamics of E. coli genetic evolution, the model that we developed enabled us to map simulated genomic changes and allowed us to make inferences about the use of PFGE as a tool to assess isolate relationships.
The key point of RFPs in general and PFGE specifically is that while the data infer genetic relationships between isolates, they do not necessarily represent true genetic relationships (6). Differences in RFPs indicate that isolates are genetically different, but the true degree of the genetic distance separating these isolates cannot be determined from RFPs. In contrast, similarities in RFPs do not necessarily mean that isolates are genetically similar. As the number of REs included in PFGE increases, the correlation between RFP similarity and true genetic similarity is likely to increase (6). However, the conclusions drawn from any molecular study must be put in the context of the other information associated with the isolates. The strength of isolate identity is greatest when epidemiologic data support point source or common elements of dissemination. Because of the high degree of subjectivity involved with the interpretation of RFPs, the user must carefully and thoughtfully select the conditions and techniques for performing, analyzing, and using PFGE fingerprints.
|
|
|---|
This article has been cited by other articles:
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Copyright © 2009 by the American Society for Microbiology. For an alternate route to Journals.ASM.org, visit: http://intl-journals.asm.org | More Info»