Previous Article | Next Article ![]()
Journal of Clinical Microbiology, November 2005, p. 5483-5490, Vol. 43, No. 11
0095-1137/05/$08.00+0 doi:10.1128/JCM.43.11.5483-5490.2005
Copyright © 2005, American Society for Microbiology. All Rights Reserved.
Biomathematics Group,1 Laboratory of Molecular Genetics, Instituto de Tecnologia Química e Biológica, Universidade Nova de Lisboa, Oeiras, Portugal,2 Laboratory of Microbiology, The Rockefeller University, New York, New York,3 Department Biostatistics, Bioinformatics, and Epidemiology, Medical University South Carolina, Charleston, South Carolina4
Received 5 May 2005/ Returned for modification 24 June 2005/ Accepted 9 August 2005
|
|
|---|
|
|
|---|
An enormous variety of band patterns have been found for each bacterial species, with the type and the subtype classification being achieved by the widely used criteria of counting the number of band differences between two lanes proposed by Tenover et al. (23): if two strains differ by up to six bands, counted in both lanes, they are considered the same type. However, these authors pointed out that this method of classification should be used in outbreak studies only and should be backed up with other relevant typing data, such as antibiotic resistance.
Nevertheless, in the majority of longitudinal studies, the use of this criterion (22) yields good discrimination results, particularly when a small number of strains with distinct patterns are being compared (5, 15, 19). This is usually confirmed by visually inspecting the cluster tree to find the cutoff linkage value that agglomerates the band patterns, in accordance with the criteria of Tenover et al. (23).
However, as the number of strains to be clustered increases, this procedure will eventually fail to work because the same difference will span different groupings. This observation is a reflection of the fact that type definitions are arbitrary, in the sense that they reflect the process of strain identification gradually filling a domain of possible band patterns. The loss of a clear distinction between groups produced by hierarchical clustering algorithms (22) will also cause the membership in existing clusters (types) to be rearranged at the previously used cutoff value when a new strain is added to the collection.
A possible solution to the classification instability would be to use a large collection of classified patterns and determine what similarity value produces the best classification results. New strains would be classified by calculating the band similarity to all the entries in the existing catalog and using the highest similarity value determined to recognize membership in the same type. Such a solution would also require the determination of which band similarity coefficient best reproduces the reference classification. Accordingly, in this paper we evaluate the commonly used band-based similarity coefficientsthe Dice, Jaccard, Jeffrey's X, Ochiai, Cosine, and Pearson's correlation coefficientsfor use for the automatic classification of both type and subtype. The comparison is performed with reference to a collection of 1,798 isolates of Streptococcus pneumoniae visually classified, using the criteria of Tenover et al. (23), into 96 types and 396 subtypes. The assessment of goodness of classification of the different similarity coefficients is performed by using receiver operating characteristic (ROC) curves (8) to determine the ability of the different similarity measures to discriminate the visually recognized groups. The method described in this paper highlights the critical value of large visually classified strain collections as the foundation for the computerized automation of their classification. However, once the most effective similarity measure is found, the prospect is raised that the classification itself may be worth redefinition to adjust it to the natural granularity of the microbial population.
|
|
|---|
Visual similarity group (VSG) assignments. PFGE patterns were assigned to types and subtypes by visual inspection of the macrorestriction profiles by using currently accepted criteria (23). Two strains are considered of the same subtype if they have an exact match of band patterns and are considered of the same type if they have up to six band differences on both lanes. In the rare case that a strain could have less than six differences from two types, the type assignment was done by comparison of the strain to all the strains of the two types, and the strain was then assigned to the type with a fewer overall number of band differences. In these cases the type assignment was also supported by other epidemiological information, such as antibiotic resistance patterns and, more recently, multilocus sequence typing information.
The type and subtype names were assigned one or more capital letters. The first pattern identified for a subtype in a type was assigned only a capital letter (e.g., A), and the remaining subtypes were named with capital letters and numbers (e.g., A2 and A3).
Gel analysis. A database of the PFGE patterns was created with Bionumerics software (version 3.0, Applied Maths, Ghent, Belgium). The gel photos were scanned and imported into a Bionumerics database as inverted 8-bit gray-scale TIF images. For each image, spectral analysis included in the software was used to determine the disk size that should be used in "rolling disk" background subtraction (background scale) and the cutoff threshold for least-squares filtering (Wiener cutoff scale). Furthermore, a median filter was used in the image to further smooth the densitometric curves.
After this image preprocessing, intergel and intragel normalizations of the PFGE runs were done with the S. pneumoniae R6 strain as a molecular marker. All the gels had three markers: one in the second lane, one lane in the middle, and in the lane before the last. Fifteen bands from 16,320 bp to 340,914 bp were used. The existence of these bands was verified, and their sizes were calculated by virtual digestion of the gel by using a perl script to recognize the restriction sequence of SmaI (CCC
GGG) in the GenBank file of the complete sequence (10). A cubic spline curve was used for the normalization and calibration of each gel. Strain R6 was obtained from the Rockefeller University culture collection.
On all gel images, band assignment was manually curated after automatic band detection. This step is of paramount importance, since there are band intensity variations from gel to gel, which cause errors in the automatic band assignment. Bands ranging from 14 kbp to 400 kbp were considered in this study.
The software was then used to calculate the alternative band pattern similarity coefficients. For the 1,798 isolates used in this study, a comparison was created and the corresponding similarity matrices were exported by using the four different band-based similarity coefficients (the Dice, Jaccard, Jeffrey's X, and Ochiai coefficients) and two curve-based correlation coefficients (the Pearson and Cosine coefficients). For the comparative evaluation of the different band-based coefficients, the optimization parameter was evaluated with a range of band position tolerances of from 0% to 8%.
Band-based similarity coefficients. The four most popular band-based similarity coefficients were considered in this study for quantification of the similarities between PFGE band patterns: the Dice (7, 22), Jaccard (22), Jeffrey's X (18a), and Ochiai (18) coefficients (Table 1). All these coefficients exclude negative band matches, which is a necessary compromise, since all possible band positions are unknown.
|
View this table: [in a new window] |
TABLE 1. Band-based similarity coefficients between any two gel band patterns, i and j
|
ROC curves. ROC curves were used to assess the classification by use of the different similarity coefficients. This method, created in signal detection theory, is frequently used in classification problems and is widely applied in medical diagnosis and psychometric analysis (8). This method is commonly employed for the binary classification of continuous data, usually categorized as positive and negative cases. In our study, the correct classification was considered the VSG assignment; for each coefficient, the VSG assignment thus classified each case at each threshold as true positive (TP), true negative (TN), false positive (FP), or false negative (FN). The classification accuracy of each coefficient was then measured by plotting for the different threshold values the ratio of the number of true-positive classifications over the total number of positive classifications, also named the sensitivity or the true-positive rate, versus the false-positive rate, or 1 specificity (Table 2), which is the ROC curve.
|
View this table: [in a new window] |
TABLE 2. ROC curve parameters
|
|
|
|---|
![]() View larger version (46K): [in a new window] |
FIG. 1. Representation of VSG classification and band patterns for the 1,798 strains of S. pneumoniae. In the upper part (visual similarity group classification matrix), the black areas represent PFGE subtypes and the gray areas represent PFGE types. The most represented groups (PFGE types) are (point 1) A (67 isolates), (point 2) AO (65 isolates), (point 3) B (292 isolates), (point 4) DDD (51 isolates), (point 5) E (187 isolates), (point 6) FF (238 isolates), (point 7) M (131 isolates), (point 8) MM (107 isolates), (point 9) R (57 isolates), and (point 10) SI (47 isolates). The lower part of the figure includes the corresponding PFGE band patterns. The lines were drawn to help the reader isolate the PFGE patterns visually.
|
For example, in Fig. 2, ROC curves are plotted for the comparison of the visual type assignments of the band and Dice coefficient values for different band position tolerance settings. This illustrates how best the tolerance value for the Dice coefficient can be determined. The table inset in Fig. 2 provides the corresponding AUC values. The Pearson correlation coefficient AUC value is also included to illustrate the relative inefficient classification of correlation similarity coefficients (AUC of 0.901 versus an AUC up to 0.984 for the Dice coefficient). Even worse performance was found for the Cosine correlation coefficient, with an AUC value of 0.882 (not plotted).
![]() View larger version (36K): [in a new window] |
FIG. 2. ROC curves for several band position tolerances of the Dice coefficient in type classification. The maximum AUC value, 0.984, was found for a band position tolerance of 1.7%. The random classification (straight diagonal; AUC, 0.5) and the underperforming Pearson's correlation coefficient (AUC, 0.901) are plotted for reference.
|
![]() View larger version (31K): [in a new window] |
FIG. 3. Area under the curve of ROC curves of the coefficients tested for different band position tolerances for subtype (A) and type (B) classification. Contribution of false-positive and false-negative classifications for the total classification error in subtype (C) and type (D). The Dice coefficient is identified by squares, the Jaccard coefficient is identified by diamonds, the Ochiai coefficient is identified by asterisks, the Jeffrey's X coefficient is identified by circles, the Pearson coefficient is identified by a dotted line without markers, and the Cosine coefficient is identified by a solid line without markers. For panels C and D, FP classifications are represented by gray dotted lines, and FN classifications are represented by black solid lines.
|
Although the different band-based similarity coefficients are surprisingly equivalent regarding the goodness of classification, the proportions of true-positive and false-positive subtype classifications differ. Figure 3C and D represents the contribution of false-positive or false-negative classifications on the total classification error.
For each band position tolerance, the point where the similarity coefficient threshold had a minimum absolute classification error (a minimum of false-positive plus false-negative classifications) was plotted. For example, by using the Dice coefficient for type classification, the similarity threshold value with minimal classification error was found to be 81% for a 1.7% band position tolerance (Fig. 4).
![]() View larger version (30K): [in a new window] |
FIG. 4. ROC curves and threshold representation for subtype (A) and type (B). This figure allows the choice of a threshold value as a function of the false-positive rate/true-positive rate, for the optimal band position tolerance settings that provide a maximum discrimination between types. Note that the false-positive rate (which corresponds to 1 specificity) is represented on a logarithmic scale.
|
Regarding the type classification, band-based similarity coefficients also performed equally well (Fig. 3B), but the heterogeneity of band patterns included in each type is reflected by the persistence of false-negative classifications for wider band position tolerance values. At a band position tolerance of 1.7%, the four band-based similarity coefficients are nearly indistinguishable in terms of the contribution of false-positive and false-negative classifications to the type classification error.
These calculated optimal position tolerance settings apply only to the data analyzed in this study, although it is a very good starting point for data obtained by the same PFGE protocol, since the running conditions should be similar and should generate similarly resolved band patterns.
As suggested by the results plotted in Fig. 3, the fact that the four band-based similarity coefficients performed equally well for the same band position tolerance implies that there are equivalent, but not necessarily similar, threshold values between each of the band pattern similarity measures. This equivalence is confirmed in Fig. 4, where, for optimal band tolerance (1.7% for type; 2.5% for subtype), the ROC curves and corresponding threshold values are displayed. Figure 4, as discussed in the next section, can be used to determine the appropriate threshold values for the desired proportion of false-positive and false-negative classifications in the total classification error. Figure 4 can be analyzed to produce optimal threshold values for arbitrary cost-benefit ratios. For example, if FP and FN classifications are equally undesirable, the four band-based similarity coefficient should be used with the band identity tolerance values indicated in Table 3.
|
View this table: [in a new window] |
TABLE 3. Threshold similarity values for the point where there are a minimum of misclassifications (minimum of false positives and false negatives) of subtype and type
|
|
|
|---|
As expected, discrete band-based similarity coefficients clearly outperformed the correlation coefficients, leading to a much higher goodness of classification, as assessed by the area under the ROC curve. Surprisingly, all of the band-based similarity coefficients tested were found to be equally discriminant for both type and subtype (Fig. 3). That is, all of the four similarity coefficient band-based formulations (Table 1) will produce the same percentage of erroneous classifications for a given band identity tolerance value (Fig. 3A and B). However, this does not necessarily imply that the erroneous classifications will include the same proportion of false-negative and false-positive classifications.
As noted above, the results presented in Fig. 3 for the dependence of goodness of classification, as assessed by the corresponding AUC value, suggest not only that the four band-based methods will perform equally well but also that they will perform equally well for the same band tolerance values (Fig. 3A and B). This observation was observed to be valid for both type and subtype classifications. However, inspection of the corresponding proportions of FP and FN classifications (Fig. 3C and D) shows that, for subtype classifications (Fig. 3C), the Dice and the Jaccard coefficients will yield comparatively more FP classifications and fewer FN classifications than Jeffrey's X or the Ochiai coefficient. This distinction is the most pronounced when the goodness of classification (AUC) is maximal. It is also interesting that for subtype classification with exaggerated band identity tolerance values, the erroneous classifications will be heavily dominated by false-positive classifications. In contrast, neither of these observations is valid for type classification (Fig. 3D), where the proportion of false-positive and false-negative classifications is not noticeably different between the band-based methods assessed, and high band tolerance values do not cause false-positive classifications to predominate. It is also noteworthy that the proportions themselves (Fig. 3D) are somewhat erratic, which is a reflection of the fact that any of the two band patterns classified as the same type can have up to six band differences, allowing for a great heterogeneity of patterns.
The discussion above highlights the observation that if bands that discriminate between types are in close proximity to each other and are possibly bands of lower molecular size (from approximately 19 kbp to 100 kbp), misclassification will eventually occur as more subtypes are identified. This heterogeneity of patterns for isolates of the same type and the blurring of arbitrary type distinctions by new isolates justify why the number of false-negative classification contributions did not decrease for higher band identity tolerance values. This is in sharp contrast to what happens in subtype classification (Fig. 3C), where band patterns should be exactly equal (band-based similarity coefficient value of 100%) between strains of the same subtype. In practice, experimental conditions can cause small distortions in the gel that are not compensated for by software or visual classification, and that is why the determination of the optimal band identity tolerance is critical for automation of band pattern classification of subtypes. Accordingly, the similarity levels for strains of the same subtype oscillate in the 95 to 100% interval, even after selection of an optimal band position tolerance setting. These results suggest that while automation of the classification of PFGE band patterns of visually recognized subtypes and types is achieved with considerable accuracy by the proposed method (maximum AUC values of 0.9954 and 0.9837, respectively, for the Dice coefficient), visual assignment mostly delimits arbitrary groupings of subtype patterns. On the contrary, the automated classification of PFGE band patterns of subtypes confines defined groups where the band positions oscillate only very slightly around a reference value.
The immediately useful result of this paper is delivered in Fig. 4. It plots, for the optimal band position tolerance value, the logarithm of the false-positive rate versus the true-positive rate and the respective threshold values. The logarithm of the false-positive rate provides easier reading of the values for the lower false-positive rates (from 0.001 to 0.1). Figure 4 allows the choice of a threshold value as a function of the false-positive rate/true-positive rate for the optimal band position tolerance settings that provide a maximum discrimination (measured by the AUC). This choice weighs the relative cost of having a false-positive or a false-negative classification. For example, if the objective of a study is to recognize membership in a specific PFGE type, the threshold should be chosen to minimize the number of false-negative assignments. If the Dice similarity coefficient was the metric chosen and the acceptable false-positive classification was only 1%, then the threshold obtained by inspecting Fig. 4 would be about 80%. If, instead, the goal was the maximization of the true discovery rate, then the appropriate threshold for the same method would be 97% (this result is also listed in Table 3). This exercise also illustrates the conclusion that although the similarity coefficients perform equally well, they are not interchangeable, as different proportions of false-negative and false-negative classifications may result. Conversely, Fig. 4 can also be used to determine what threshold values will render two similarity coefficients equally discriminant for the optimal band position tolerance value.
Over the past few years large databases of genotyped clinical strains have been assembled. These repositories contain a unique record documenting both the diversity and the dynamics of the emergence of new strains. Furthermore, it has been consistently shown that PFGE has a higher discriminatory power than newer sequence-based methods, such as multilocus sequence typing, which justifies the prospect that the cost-effective use of PFGE will be seamlessly integrated with other genotyping methods in even larger repositories. In that regard, the study reported here leads to the following conclusions.
First, we have found that the perception that band-based similarity coefficients are superior to correlation methods is correct, provided that they are correctly parameterized. This observation puts a prize not only on the correct parameterization method but also on the use of robust image analysis software for gel lane alignment and band recognition.
Second, we have used a repository of 1,798 PFGE types isolates of S. pneumoniae to assess the relative merits of the different band-based similarity coefficients: the Dice, Jaccard, Jeffrey's X, and Ochiai coefficients. Surprisingly, they were all found to be equally able to classify the isolates from the reference database, with equivalent performances occurring for distinct thresholds but the same band position tolerances. The goodness of classification was assessed by use of the AUC of the ROC curve.
Third, the equivalence in AUC with the same proportion of erroneous classifications was found to correspond to different proportions of false-positive and false-negative classifications, which will play a role in the selection of a similarity coefficient for use in a fully automated bioinformatic implementation. Consequently, the assessment and parameterization of PFGE similarity coefficients are delivered as ROC curve plots with the corresponding threshold values (Fig. 4), where the cost-benefit assigned to the different types of erroneous classifications can be weighted quantitatively and the most appropriate method and threshold values can be selected.
Fourth, the automated procedure was found to perform satisfactorily, with an optimal AUC of 0.984. This result supports the conclusion that the implementation of automated classification is highly advantageous, particularly since multiparametric statistics can be associated to select those patterns that warrant subsequent visual inspection.
The optimal parameterization of band-based similarity coefficients opens the prospect of revisiting the identification of types as a dynamic entity defined by unsupervised classification algorithms such as nearest means (K means) or self-organized maps. Therefore, the identification of similarity metrics that reproduce and automate the classification of typing results enables the redefinition of heterogeneous types in S. pneumoniae with time-dependent identities that converge to the confinements of the natural populations as more isolates are characterized. The tracking of how the definitions evolve could be solved automatically by the implementation of repositories that can be queried by use of the shortest similarity coefficient value. The methods used in this paper can be used in any database to determine which similarity metric is more adequate to describe the data and also which parameters optimize the classification procedure.
Partial support for this work was provided by contracts EURIS (QLK2-CT-2000-01020) and PREVIS (LSHM-CT-2003-503413 from the European Community) awarded to H. de Lencastre and J. S. Almeida. J. A. Carriço and F. R. Pinto were supported by grants SFRH/BD/3123/2000 and SFRH/BD/6488/2001, respectively, both from the Fundação para a Ciência e Tecnologia of Portugal. S. Nunes and N. G. Sousa were supported by grants 011/BIC/01 and 043/BIC/00, respectively, from contract QLK2-CT-2000-01020; S. Nunes, N. G. Sousa, and N. Frazão have been supported since March 2004 by grants 010/BIC/2004, 009/BIC/2004, and 011/BIC/2004, respectively, from contract LSHM-CT-2003-503413. C. Simas was supported by a grant from IBET, project WLP (grant 31 CEM/NET); and N. Frazão was also supported by IBET grant 28/12/02.
|
|
|---|
This article has been cited by other articles:
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Copyright © 2009 by the American Society for Microbiology. For an alternate route to Journals.ASM.org, visit: http://intl-journals.asm.org | More Info»