Previous Article | Next Article ![]()
Journal of Clinical Microbiology, September 2008, p. 2868-2873, Vol. 46, No. 9
0095-1137/08/$08.00+0 doi:10.1128/JCM.01000-08
Copyright © 2008, American Society for Microbiology. All Rights Reserved.

Department of Basic Science and Craniofacial Biology,1 Department of Cariology and Comprehensive Care,2 Department of Epidemiology and Health Promotion, College of Dentistry,3 School of Medicine, New York University, New York, New York 100104
Received 23 May 2008/ Returned for modification 6 June 2008/ Accepted 22 June 2008
|
|
|---|
|
|
|---|
In our previous study, we demonstrated that strains of S. mutans strains associated with S-ECC differ in their genomic composition compared to caries-free (CF) controls (42). Using the power of suppressive subtractive DNA hybridization (SSH), several unique gene segments were identified from a strain of S. mutans (AF199) that was isolated from a child with S-ECC. The presence of unique genetic loci among S. mutans strains is consistent with the recent work by Waterhouse and Russell (51), as they described the presence of "dispensable genes" distributed among strains of S. mutans. These segments include mobile genetic elements that are widely distributed in S. mutans (2) and have been shown to modulate sucrose (31) and melibiose metabolism (40). S. mutans strains also vary in content in terms of the presence of plasmids (10, 32), mutacin I, II, III, and IV operons (3, 19, 35-37), serotypic antigens (43); competence (34), the comBCD genes (28), and gtfBC (14, 48, 52), among other genetic loci. Based on the wide diversity of genotypes and genetic loci within S. mutans, different strains of S. mutans apparently comprise both common and unique genetic loci, and it seems that these differences are unequally distributed among strains (42, 51). Identifying the unique DNA fragments that are common to most of the strains isolated from S-ECC but not CF children will be important even if their function is unknown because the nucleotide sequences can serve as diagnostic biomarkers for DNA-based detection arrays (38, 41).
Here we report the identification of a hierarchical series of gene biomarkers derived via SSH from strains of S. mutans associated with S-ECC. These biomarkers were then evaluated by machine learning techniques for their ability to classify clinical isolates of S. mutans into one of two categories, CF or S-ECC. Our findings suggest that as few as three SSH biomarkers were sufficient to accomplish this goal.
|
|
|---|
Bacterial sample procedures and processing. Bacterial samples from saliva and pooled plaque of the S-ECC and CF children were collected before any dental treatment was initiated. The supragingival plaque samples were collected and processed as previously described (10, 26). The genomic DNA of pure cultures of isolates of S. mutans were obtained by using a genomic DNA purification kit (Qiagen, Hilden, Germany), as previously described (25, 27). All of the DNA samples from S. mutans were first subjected to chromosomal DNA fingerprinting (9, 24) to identify the genotypes of the isolates of S. mutans and then for subtractive hybridization.
SSH. SSH was used to isolate DNA fragments present in the strains of S. mutans isolated from plaque samples from S-ECC but not present in the S. mutans of the CF plaque sample. The protocol has been described elsewhere (42). Briefly, the DNA of S. mutans strains isolated from S-ECC subjects, selected as tester strains, was subtracted against the pooled genomic DNA of strains of S. mutans from CF subjects. DNA samples from CF children were mixed (2 µg of each), and 2 µg from this DNA mix was used as driver against 2 µg of each tester DNA. Six subtraction reactions were performed, using the PCR-Select bacterial genome subtraction kit (BD Biosciences, San Jose, CA) with minor modifications (42). Tester and driver samples were digested separately with RsaI. Amplification of tester-specific fragments was performed by using PCR and primers directed at tester-ligated adaptor sequences and the protocol provided by the manufacturer (BD Biosciences). Secondary PCR products (4 µl) were cloned into the pCR4-TOPO (Invitrogen) vector and transformed into Escherichia coli TOP10 cells (Invitrogen). A total of 1,300 transformants were picked at random and grown in 96-deep well plates at 37°C in 1.5 ml of Luria-Bertani medium with kanamycin for 12 h. The plasmid DNA was extracted by using 96 Turbo BioRobot Kit and BioRobot 3000 (Qiagen). False-positive results (SSH fragments present in both S-ECC and CF strains) were identified by dot blot hybridization. Purified plasmid DNA containing cloned SSH fragments were sequenced using the M13 universal primer. Sequencing reactions were performed in both directions on an ABI model 377 DNA sequencer. Nucleic acid and predicted protein compositions were compared to those archived in GenBank using BLAST (National Center for Biotechnology Information [NCBI]). Sequences were also analyzed for protein coding regions via the open reading frame finder (NCBI) and PFAM (http://www.ncbi.nlm.nih.gov/gorf/gorf.html) (47).
Dot blot hybridization. Dot blots were prepared by using standard procedures (42). PCR products from each of the 1,300 selected transformants were purified by using a PCR purification kit (Qiagen). Portions (10 µl) of the PCR products were diluted with 40 µl of 1 M NaOH, 5 µl of 200 mM EDTA, and 45 µl of sterile water. Diluted PCR products were denatured and spotted directly onto Hybond-N+ nylon membranes (Ambion), using a 96-well manifold (Gibco-BRL). Membranes were UV cross-linked by using Stratalinker (Stratagene) and stored dry before hybridization. Membranes were prehybridized for 15 min in a hybridization oven with 7 ml of warm (68°C) UltraHyb buffer (Ambion). Biotin-labeled probes were made with 100 ng of purified, RsaI-digested driver or tester genomic DNA. Hybridization was carried out for 16 h at 68°C in a rotating hybridization oven. Standard protocols for membrane washing (42) were followed, washing twice under moderate to high-stringency conditions (50 ml of 0.2x SSC [1x SSC is 0.15 M NaCl plus 0.015 M sodium citrate] plus 0.1% sodium dodecyl sulfate, 15 min, 42 to 65°C). Hybrids were detected with a BrightStar detection kit (Ambion) after exposure to BioMax X-ray films (Kodak).
PCR amplification. To determine the distribution of putative unique S-ECC-associated sequences among S. mutans, PCR primers and conditions were designed for each of the selected SSH biomarkers for screening other S-ECC and CF genotypes. In addition, each SSH fragment was analyzed for G+C content and then used to query BLAST and protein databases. Standard PCRs were carried out with either S-ECC or CF S. mutans genomic DNA. Typically, a 50-µl PCR included 2.5 µl of 10x PCR buffer (100 mM Tris-HCl [pH 9.0], 15 mM MgCl2, 500 mM KCl), 0.25 µl of 20 mM deoxynucleoside triphosphates, 1 µl of each of the forward and reverse primers (stock concentration, 50 nM), 0.5 µl (5 U) of Taq DNA polymerase (Invitrogen), and 2 µl of template DNA. PCR conditions were as follows: 94°C for 3 min; 94°C for 45 s, 54 to 59°C for 45 s, and 72°C for 60 s for 30 cycles; and 72°C for 7 min. Corresponding tester and driver strains were used as positive and negative controls, respectively. Amplicons were analyzed by electrophoresis on 1.5% agarose gels.
AI. Two independent forms of artificial intelligence (AI), support vector machine (SVM) and neural network analyses, were used to compare and calculate the sensitivity, specificity, and overall accuracy of each selected SSH biomarkers.
SVM. The presence or absence of 19 PCR-amplified SSH biomarkers (derived from the original 1,300 clones minus false-positives) from S-ECC strains by SSH versus pooled strains from CF children) was assessed from each of a total of 49 clinically isolated S. mutans strains S-ECC (n = 26) and CF (n = 23). Amplification of each SSH biomarker in each strain was scored as present ("1") or absent ("0"). SVM was used as a supervised learning method to identify an optimal combination of the S-ECC biomarkers that could correctly classify clinical isolates of S. mutans into one of two categories: S-ECC associated or CF associated. Forty-nine S. mutans strains were randomly assigned to a training set (60%) or a tester set (40%). The SVM classifier program (WEKA, Sequential Minimal Optimization [http://www.cs.waikato.ac.nz/ml/weka/]) was then run on the training set, resulting in an algorithm that defined the S-ECC strains. Due to the limited size of each data set, cross-validation within the original data set was utilized to provide a nearly unbiased estimation of classification. For each classification, the true-positive, true-negative, false-positive, and false-negative values were obtained from which accuracy, sensitivity, and specificity were calculated by using Health Decision Strategies EpiMax software (15).
The SVM analysis was extended by reducing the number of features (SSH biomarkers) used to build the classifier (attribute and/or feature selection). Using the WEKA software, the markers were ranked by information gain; first, the top 10 and then the top 5 markers were chosen to train and classify, cross-validate, and classify the test set.
Neural network and recursive partitioning tree. A feed-forward, back-propagation, neural network with five input nodes, 20 hidden nodes, and two output nodes was also used to assess the discriminatory capability of the SSH-biomarkers. The five input nodes were for SSH biomarkers 0018, H7, 0102, 0006, and 0004 and were identified from analysis by SVM and the classification tree (see below). A total of 50% of the S-ECC- and CF-associated strains of S. mutans were randomly selected as a training set, and the remaining cases formed the test set. The process of selecting the training set, training the net, and testing was repeated 1,000 times, and the mean sensitivity and specificity and their empirical 95% confidence intervals were calculated.
Finally, a recursive partitioning tree was constructed from the pool of all SSH fragments and pruned to three levels. At each level, a binary decision was applied based on the presence or absence of the (automatically) selected fragment. Neural net analysis and recursive partitioning tree determinations were performed in R version 2.6. (www.R-project.org).
Nucleotide sequence accession numbers. The nucleotide sequences unique to cariogenic strains of S. mutans were deposited in GenBank under accession numbers EU918292 to EU918301.
|
|
|---|
![]() View larger version (14K): [in a new window] |
FIG. 1. Overall study design for subtractive DNA hybridization, attributes/biomarker selection, and classification analysis.
|
|
View this table: [in a new window] |
TABLE 1. Characterization of S-ECC S. mutans specific DNA biomarkers obtained from SSH librariesa
|
The SVM classifier algorithm generated a model capable of differentiating between S-ECC and CF strains. For each category (S-ECC or CF), the number of true-positive, true-negative, false-positive, and false-negative values were calculated and then used to estimate the overall accuracy, sensitivity, and specificity of the various models (Tables 2 and 3). The resulting classifier correctly partitioned strains into either S-ECC or CF-associated with accuracy of 90% and a sensitivity and specificity of 89 and 90%, respectively (Table 3). The classifier was run again using only the most informative five biomarkers (0018, H7, 0006, 0007, and 0004). With just five biomarkers, the accuracy of classifying strains improved to 92% (Table 2). Biomarker 0018, which was similar (e value of 8e-32) to a hypothetical protein from Staphylococcus haemolyticus, was the most informative of the five and was present in most of the S-ECC S. mutans strains. These five biomarkers were present in 90 to 100% of the 26 S-ECC isolates tested, suggesting that these genes play a functional role in the pathogenic potential of S. mutans.
|
View this table: [in a new window] |
TABLE 2. Summary of stratified cross-validation analysis of biomarkersa
|
|
View this table: [in a new window] |
TABLE 3. Accuracy of biomarkers by either class S-ECC or CFa
|
![]() View larger version (14K): [in a new window] |
FIG. 2. Recursive partitioning classification of fragment to optimally discriminate caries status. Terminal nodes are shown as squares, and nonterminal nodes are shown as circles. Each node is labeled as either S-ECC or CF depending on the simple majority of cases within the node. Decision rules are shown on the lines connecting the nodes. This analysis shows that fragments 0018, H7, and 0006 used in a decision tree result in a sensitivity of 96% for S-ECC status and 91% for CF status.
|
|
|
|---|
That individual strains of S. mutans differ in their genetic composition has been demonstrated in a number of studies (10, 28, 35, 37, 43, 53), and the variation may be as much as 20%, comprising the "dispensable" genome (51). For example, variation in the com genes that mediate quorum sensing and genetic competence shows variation in distribution and genetic composition (4, 50). It may not be a coincidence that all of the loci described above have been linked directly or indirectly with S. mutans' virulence, and all exhibit variation. Some strains of S. mutans harbor 5.6-kb cryptic plasmids, and there are sufficient polymorphisms at the nucleotide level to allow phylogenetic ordering of plasmid-containing strains of S. mutans, giving insight into the evolutionary history of its human host (10).
Genetic variation among strains within a given species is not uncommon (for a review, see reference 1). Escherichia coli, for example, varies in intraspecies genetic composition among natural isolates as much as 20%; this is not surprising given its wide host range (44). Strains of Helicobacter pylori vary (29), with in silico comparisons between genomes of ca. 7%. Comparison of genomes of Staphylococcus aureus showed that strains are "peppered" with mobile elements and contains large blocks of genes in pathogenicity islands, with 6% of the genome being strain specific (13, 16, 20). Comparisons between strains of S. pyogenes (12) and between strains of S. pneumoniae (7) showed intraspecies differences ca. 10%. In both of these close relatives of S. mutans, differences are manifest to a large extent in the presence or absence of large blocks of genes. In many medically important bacteria, strain-specific genes reside in large chromosomal regions called genomic or pathogenicity islands (21-23). S. mutans UA159 contains at least 11 genomic islands (Los Alamos Oralgene site [http://www.oralgen.lanl.gov/]), but the distribution of these and other putative genomic islands among different strains of S. mutans remains unknown. Some of these genomic islands may be directly associated with the expression of virulence.
A novel aspect of the present study was the use of AI learning algorithms to use the presence or absence of SSH fragments to classify each S. mutans isolate by caries state (S-ECC or CF). Recent literature strongly suggests that AI approaches to classification outperform "classical" statistical method (50). This method provides a scaleable solution that can expand to incorporate multiple data types and large numbers of samples. AI, such as SVM, is very commonly used in disease prediction and pattern recognition in microarray data analysis, especially for cancer prediction. SVM algorithms have been successfully used in bacterial proteins (17, 18, 30, 39), metabolites (11), and pattern recognition and yielded >90% accuracy. In the present study the results from two independent forms of AI, SVM and neural network analyses, were compared to the true status of each sample to calculate sensitivity, specificity, and overall accuracy of the output. Exact binomial tests of independent proportions were used to identify fragments that exhibited maximal differentiation of S-ECC and CF. To control for the many multiple comparisons, we used adjusted P values to control for the false discovery rate (45). The number of SSH fragments needed to accurately classify strains can be reduced by considering a two-stage hierarchical classification procedure. As seen in Table 2, this can be achieved with high accuracy (92.0%), with only five biomarkers using a linear SVM. Table 3 shows the average prediction accuracy achieved on other pairwise discriminations, indicating that the CF versus S-ECC distinction can be made with high accuracy using just five SSH fragments. Our studies indicated that virulent clones possess the most important biomarkers, and most of the biomarkers identified are present in various strains. This finding is consistent with others that genetic variation among strains within a given species is common, and these genetic changes are associated with disease. Recently, McMillan et al. (33) reported that reemergence of severe, invasive group A streptococcal diseases could be caused by altered genetic endowment in these organisms. Using similar approach of neural net they identified three genes with a marginal overrepresentation in invasive disease isolates. Significantly, two of these genes, ssa and mf4, encoded superantigens but were only present in a restricted set of group A streptococcal M types. The third gene, spa, was found in variable distributions in all M types in the study (33). Using a similar approach, we identified a small set of SSH fragments that can be used to computationally "predict" S-ECC and CF S. mutans with high accuracy. In addition to SVM we also applied artificial neural networks algorithms to determine the robustness of fragments in classifying strains into S-ECC or CF. The recursive partitioning tree confirms that just three of these fragments (0018, H7, and 0006) can produce a classification accuracy of 94%. Thus, our methodology identifies an optimum combination of genes that may have the highest effect on the characteristic of S. mutans.
These data demonstrated successfully that DNA biomarkers obtained from SSH can be used to classify strains of S. mutans into S-ECC and CF groups. Further independent validation studies with larger sample size are warranted to evaluate the true potentials of these biomarkers. Even though these types of analyses do not tell us what the function or role of particular genetic loci plays in heath or disease, it does provide a panel of biomarkers that may be applicable to risk assessment. In addition, mapping of these fragments or biomarkers onto the chromosome will lead to the possible discovery of genomic islands or other horizontally acquired genetic loci that might be important in contributing to the overall virulence of S. mutans, including those which have yet to be identified. Since S. mutans is largely transferred from mother to child, a chairside test might be devised from these biomarkers capable of indicating potential risk to a child based on their own or their mother's strains of S. mutans. If successful, such a test would have tremendous public health implications for identifying children at risk before they experience this devastating disease.
We thank Michele Savel and Hareeti R. Gill for assisting in collecting clinical samples and Liying Yang for technical support.
Published ahead of print on 2 July 2008. ![]()
|
|
|---|
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Copyright © 2009 by the American Society for Microbiology. For an alternate route to Journals.ASM.org, visit: http://intl-journals.asm.org | More Info»