ABSTRACT
Staphylococcus epidermidis is a ubiquitous colonizer of human skin and a common cause of medical device-associated infections. The extent to which the population genetic structure of S. epidermidis distinguishes commensal from pathogenic isolates is unclear. Previously, Bayesian clustering of 437 multilocus sequence types (STs) in the international database revealed a population structure of six genetic clusters (GCs) that may reflect the species' ecology. Here, we first verified the presence of six GCs, including two (GC3 and GC5) with significant admixture, in an updated database of 578 STs. Next, a single nucleotide polymorphism (SNP) assay was developed that accurately assigned 545 (94%) of 578 STs to GCs. Finally, the hypothesis that GCs could distinguish isolation sources was tested by SNP typing and GC assignment of 154 isolates from hospital patients with bacteremia and those with blood culture contaminants and from nonhospital carriage. GC5 was isolated almost exclusively from hospital sources. GC1 and GC6 were isolated from all sources but were overrepresented in isolates from nonhospital and infection sources, respectively. GC2, GC3, and GC4 were relatively rare in this collection. No association was detected between fdh-positive isolates (GC2 and GC4) and nonhospital sources. Using a machine learning algorithm, GCs predicted hospital and nonhospital sources with 80% accuracy and predicted infection and contaminant sources with 45% accuracy, which was comparable to the results seen with a combination of five genetic markers (icaA, IS256, sesD [bhp], mecA, and arginine catabolic mobile element [ACME]). Thus, analysis of population structure with subgenomic data shows the distinction of hospital and nonhospital sources and the near-inseparability of sources within a hospital.
INTRODUCTION
Staphylococcus epidermidis is a commensal of human skin and a common contaminant of clinical specimens, but it is also an important human pathogen (1, 2). Currently, the coagulase-negative staphylococci (CoNS), of which S. epidermidis is the species most commonly isolated from humans, ranks as the number one cause of central line-associated bloodstream infections, the second-most-common cause of surgical site infections, and the third-most-common cause of all health care-associated infections reported to the National Healthcare Safety Network from 2009 to 2010 (3, 4). Uncertainty in the clinical interpretation of S. epidermidis blood cultures can delay or misguide diagnosis and treatment, increasing both morbidity and treatment costs (5, 6). The ideal of distinguishing “true” infection from specimen contamination has not yet been realized, and even the strictest definitions of S. epidermidis sepsis have been fraught with exceptions, false positives, and examples of polyclonal infection (7, 8).
The diagnosis of S. epidermidis infections could be aided by the identification of markers that accurately distinguish between infection and contaminant or commensal sources. Antimicrobial resistance and biofilm phenotypes as well as the genetic markers mecA, icaA, and IS256 have repeatedly been shown to be more common in hospital isolates than in nonhospital isolates, but these markers are not necessarily useful for distinguishing infection isolates from coresident hospital isolates that contaminate clinical specimens (9–13). Such markers may promote a hospital lifestyle and thus provide increased opportunities to cause infections. In contrast, the genetic markers fdh and arginine catabolic mobile element (ACME) have been reported to be more common in contaminant or commensal isolates than in true infection isolates (14–16).
The search for markers of pathogenicity has extended to studies of S. epidermidis population genetic structure. Multilocus sequence typing (MLST) has identified clones such as sequence type 2 (ST2) that are common in hospitals (15, 17–24). However, a robust classification of S. epidermidis STs into larger groups of related STs has been lacking (25). Recently, we used Bayesian clustering of the MLST data in the international database to identify a species-wide population structure of six genetic clusters (GCs) that may relate to bacterial lifestyle (26). Analysis of isolates from clinical specimens from a New York hospital showed that GC5 was common and enriched for hospital-associated markers such as antibiotic resistance, high biofilm production, icaA, IS256, and sesD (bhp), suggesting a lifestyle adapted to the hospital environment (26). GC1 and GC6 were also commonly isolated from clinical specimens but were not associated with the tested markers (except GC6 and sesF [aap]), suggesting a more generalist lifestyle. GC2 was rare from clinical specimens and positive for the putative commensal marker fdh. GC3 was also rarely isolated from clinical specimens, and it was identified as a cluster with a significant admixture of DNA from all other clusters (26). Results from a recent genomic analysis of diverse S. epidermidis isolates were consistent with this MLST classification; specifically, genomic group A included MLST groups GC5, GC1, and GC6 and was separated from genomic group B, which included MLST groups GC2 and GC4 (27). Recombination was most extensive in genomic group C, which included MLST group GC3 (27).
In this study, using a larger, updated MLST database, we verified that six GCs define the population genetic structure of S. epidermidis. We developed a SNP assay for accurately assigning isolates to GCs without the need for full MLST or genomic data. To test the hypothesis that GCs could distinguish isolation sources, we applied this system to three collections of S. epidermidis isolates representing “true” bacteremia, blood isolates considered to be contaminants, and nonhospital carriage isolates. We further characterized isolates for seven previously studied genetic markers and developed a machine learning algorithm to predict isolation sources with these data.
MATERIALS AND METHODS
Bacterial isolates.Isolates were collected at the OSF Saint Francis Medical Center in Peoria, Illinois, with the approval of the Peoria Institutional Review Board. Blood cultures were processed in the OSF System Laboratory using a Bactec blood culture system (Becton Dickinson). Several typical colonies were picked for identification and sensitivity, done in a Vitek automated system (bioMérieux). The subcultures were then stored on slants. Isolates were recovered from slants in the Pediatric Research Laboratory, University of Illinois College of Medicine at Peoria, on tryptone soya 5% blood agar. Single representative colonies were picked by one physician-microbiologist (B. M. Gray). The predominant strain was selected by colony morphology from each of one to six separate blood cultures. Single-colony picks were also made for presumed contaminant strains.
The total of 154 isolates were derived from three sources.
(i) There were 59 isolates from 32 adult patients with “true” bacteremia, as determined from two positive blood cultures obtained within 24 h, having similar colony morphologies, plus evidence of infection confirmed by chart review. Two exceptions were a patient who had a single blood culture associated with an infected vascular graft and another with an associated skin infection. The selection of patient strains was intended to provide a set of isolates with high specificity for infection (7, 8). Samples from 17 of the infected patients also had 21 isolates deemed to be contaminants from the same or separate blood cultures as the predominant infecting strain.
(ii) There were 55 isolates considered to be contaminants: the 21 contaminant isolates from the infected patients just described and 34 isolates from 26 patients who had only a single positive blood culture and evidence against infection upon chart review. Results from these two sets of contaminants were analyzed separately and together and were combined for the final analyses described below. All bacteremia and contaminant blood culture isolates were collected from March 2013 through February 2014; patients ranged in age from 19 to >80 years; 51% were male.
(iii) There were 40 isolates from 23 nonhospital subjects who were fathers visiting their infants in the neonatal intensive care unit during August 2009 through January 2010; cultures were obtained from all but three fathers within 1 week of admission of their infants, usually at their first visit. Cultures of anterior nares were obtained with Dacron swabs; cultures of both hands were obtained using a bag and buffer method.
Isolates were stored and shipped in Dorset egg medium without antibiotics (28) to the University of Mississippi Medical Center. Isolates were coded, and genetic characterization was completed in a blind fashion. Isolates were cultured overnight at 37°C on tryptone soya agar or blood agar and were cryopreserved at −80°C in a solution of tryptic soy broth with 15% glycerol. DNA was extracted using a DNeasy blood and tissue kit (Qiagen) according to the manufacturer's instructions and using a solution of 1.5% lysostaphin and lysozyme during the initial incubation steps. Species identification of isolates was confirmed by sequencing both strands of a tuf gene fragment (29) and detecting >99% nucleotide identity to a reference sequence from S. epidermidis strain ATCC 12228. Characteristics of all study isolates are given in Data Set S1 in the supplemental material.
Bayesian clustering of MLST data.The international multilocus sequence typing (MLST) database for S. epidermidis (sepidermidis.mlst.net) consisted of 588 sequence types (STs) when downloaded on 4 September 2015. Ten STs with insertion-deletion polymorphism in the tpiA gene fragment were excluded, leaving 578 STs for analysis. STs were assigned to genetic clusters (GCs) using the Bayesian clustering program BAPS v6 (30) with previously described methods (31). In brief, MLST loci were oriented and trimmed to the +1 reading frame and clustered with the codon linkage model. Upper bounds of 11 to 20 populations were considered, each evaluated five times. Admixture analysis based on mixture clustering of individuals used 100 iterations, 50 reference individuals per population, and 10 iterations per reference individual.
Identification of SNPs that distinguish genetic clusters.Seven single nucleotide polymorphisms (SNPs), comprising one SNP from each of the seven MLST gene fragments, were selected from the 578 STs to maximally differentiate GCs. SNP selection was guided by the GST statistic, which estimates the proportion of the between-GC diversity in the total diversity. GST was calculated using DnaSP v5.10 software (32).
Assignment of SNP types to genetic clusters.SNP types were assigned to GCs using an approach inspired by earlier studies that used multilocus data for probabilistic assignment of individuals to populations (33). First, a reference table was constructed by calculating the frequency of each allele for each of the seven SNPs for each GC, using data from the 578 STs (see Table S1 in the supplemental material). Next, a likelihood score for assigning each SNP type to each GC was calculated as Πpi2, where pi is the frequency of the allele of SNP i in a given GC. Zero-frequency alleles were recorded as 1/(n + 1), where n is the number of STs in the GC; this treatment assumes that zero-frequency alleles are rare and would be found with additional sampling. Finally, a given SNP type was assigned to the GC with the highest likelihood score if the log of the ratio of the highest likelihood score to the next highest was >1.3, indicating >95% confidence in the assignment.
SNP assay.PCR amplification of the MLST loci used the standard primers and thermocycler conditions described previously (34), with the exception that an annealing temperature of 50°C was used for some amplifications of gtr and pyrR loci. PCR products were combined to reach a total volume of 10 μl for each of two subsequent, allele-specific primer extension (ASPE) reaction mixtures containing PCR products from arcC, aroE, tpiA, and yqiL (reaction 1) and from gtr, mutS, and pyrR (reaction 2). The two reaction mixtures were purified of residual deoxynucleoside triphosphates (dNTPs) by addition of 1 μl of 5 U of exonuclease I (EXO) and 0.5 U of shrimp alkaline phosphatase (SAP) (Invitrogen) and incubation at 37°C for 30 min and 80°C for 15 min.
Fourteen ASPE primers were designed to detect the alleles of the seven selected SNPs (described in Results). Each of the two ASPE reaction mixtures contained 5 μl of the EXO-SAP-treated PCR products, 0.3 U of tsp DNA polymerase (Invitrogen), 25 nM ASPE primer mixture, 5 μM dATP, dTTP, dGTP, and biotin-dCTP (Invitrogen), 20 mM Tris-HCl, 50 mM KCl, and 1.25 mM MgCl2. The ASPE thermocycler conditions were 1 cycle of 95°C for 5 min and then 30 cycles of 94°C for 30 s, 55°C for 30 s, and 72°C for 1 min, with a final extension of 72°C for 3 min. The manufacturer's protocol (Luminex) was followed for hybridization of ASPE products to xTAG microspheres and washing, except that the concentrations of microspheres were increased to 125 per μl, followed by incubation in 50 μl 1× Tm hybridization buffer with 0.2% streptavidin R–phycoerythrin conjugate at 37°C for 15 min.
Samples were analyzed on a Luminex 200 system (Millipore) using Luminex Xponent v3.1 software. Results were expressed as median fluorescence intensity (MFI) for each allele. The MFI values were corrected for background by subtracting the value of the MFI of unreacted bead controls from the test MFI value. An allele was scored with a minimum threshold of >150 MFI and a proportion of MFIcalled allele/(MFIwild type allele + MFImutant type allele) of >0.9.
Detection of various genetic markers.Isolates were screened by PCR for the presence of seven genetic markers previously studied for their associations with GCs (26). These included the putative hospital markers icaA, IS256, mecA, sesD (bhp), and sesF (aap) and the putative commensal markers fdh and arginine catabolic mobile element (ACME). PCR primer sequences for these markers were listed previously (26), and thermocycler conditions were the same as those used for MLST (34).
Statistical analyses.Bivariate associations were measured with odds ratios and 95% confidence intervals (CIs), using InStat v3.1 software (GraphPad). In cases where 2-by-2 contingency tables had zero-frequency cells, 0.5 was automatically added to each cell. The diversity of SNP types within GCs was measured by Simpson's index (35) using the Comparing Partitions website (http://www.comparingpartitions.info/), with 95% CIs calculated as described previously (36).
Machine learning algorithm for prediction of isolation sources.Support vector machines (SVMs) represent a type of supervised machine learning algorithm that can perform classification (37). In essence, SVMs first transform the predictor data (in this study, binary-coded GCs and genetic markers) into a higher-dimensional space by use of a kernel function and then find a hyperplane that maximally separates the classes. Two-class prediction was done to distinguish hospital from nonhospital sources and, separately, infection from contaminant sources. SVMs were run with the e1071 v1.6-4 package of R v2.7.0 software (38). SVMs used a radial kernel and two parameters, C (cost of errors) and γ (kernel specific). Optimal values of C and γ were determined from a grid of values, using 10-fold cross-validation with a random 70% of the sample. The SVMs were trained with the same random 70% of the sample as used for cross-validation and were tested with the remaining 30% of the sample. This entire procedure was repeated 10 times, where each replicate represented a random 70:30 partition of the sample. Classification accuracy, sensitivity, and specificity were averaged across the 10 replicates. SVMs were rerun using “clone-corrected” samples, which excluded duplicate isolates of the same SNP type and source from the same patient. This clone-corrected sample totaled 119 isolates: 39 isolates from hospital infections, 47 contaminants of clinical specimens, and 33 nonhospital carriage isolates.
RESULTS
Verification of the population genetic structure of S. epidermidis.Bayesian clustering of 578 STs in the international MLST database identified six GCs (Fig. 1). A total of 419 (96%) of 437 STs previously analyzed by Thomas et al. (26) were classified into the same GCs with the updated database (see Table S2 in the supplemental material). All of the 18 STs that were reclassified involved GC3 (16 changed to GC3, 2 changed from GC3). Both GC3 and GC5 were significantly enriched for admixed STs and had the highest proportions of admixed nucleotides (Table 1). Both GC1 and GC6 were significantly underrepresented for admixed STs and had the lowest proportions of admixed nucleotides. Thus, the population structure of S. epidermidis, as inferred from Bayesian clustering of the MLST database, was relatively consistent when the sample of 437 STs was increased to 578 STs.
Assignment of 578 sequence types (STs) in the multilocus sequence typing (MLST) database to six genetic clusters (GCs). The x axis corresponds to all 578 STs in the MLST database, color coded by GC as follows: red, GC1; green, GC2; blue, GC3; orange, GC4; pink, GC5; teal, GC6. The y axis indicates the percentage of ancestry contributed to the ST by each GC.
Summary of BAPS admixture analysis of 578 S. epidermidis sequence types
Development, validation, and application of a SNP typing assay to assign isolates to GCs.One SNP from each of the seven MLST loci was selected to maximally differentiate GCs, as guided by the GST statistic (Table 2). These seven SNPs produced 54 SNP types among the 578 STs (see Table S2 in the supplemental material). The accuracy of assigning these SNP types to the same GCs as found with full MLST data was determined in silico using the approach described in Materials and Methods. The SNP types for 545 (94%) of 578 STs were correctly assigned to GCs with confidence. Of the remaining 33 STs, the SNP types for 6 STs were incorrectly assigned to GCs with confidence, and the SNP types for 27 STs were unassigned because the threshold for confidence was not met (see Table S2 in the supplemental material). SNP type 3 (CTAATAA) was represented by 143 STs, including 3 (ST145, ST161, and ST164) of the 6 STs that would be incorrectly assigned to GCs with confidence. However, the presence of the arcC8 allele can be used to identify SNP type 3 isolates that are classified among these problematic STs.
Single nucleotide polymorphisms used to assign S. epidermidis isolates to genetic clusters
Allele-specific primer extension primers were designed to detect the alleles of the seven SNPs (Table 3) with Luminex technology. This SNP assay was technically validated using 30 strains of known, diverse STs. Each of these strains' alleles matched the expected result, with a mean fluorescence intensity of >150 and an allele proportion of >0.90 (see Table S3 in the supplemental material). Application of the SNP typing assay to our study sample of 154 isolates resulted in confident assignment of each of 14 SNP types to a GC (Table 4). SNP type 3 was the most frequent SNP type, with 62 isolates; sequencing of the arcC gene fragment from these isolates showed that none had the arcC8 allele and thus did not belong to the problematic STs. Although GC2, GC3, and GC4 were relatively rare in this sample, they tended to be more diverse in SNP type than GC1, GC5, and GC6, but this result was not statistically significant (Table 4).
Allele-specific primer extension (ASPE) primers
Diversity of the six S. epidermidis genetic clusters in the Illinois population
Associations between GCs, genetic markers, and isolation sources.PCR was used to detect seven genetic markers that had been studied previously for their associations with GCs (26). GC5 was positively associated with icaA, IS256, and mecA (Table 5). GC6 was positively associated with ACME and sesD (bhp). The fdh gene was detected exclusively within GC2 and GC4 (Table 5).
Associations of genetic clusters with selected genetic markersa
While there is a large literature on the associations between some genetic markers and isolation sources, the associations between GCs and isolation sources have not been measured previously. Results in Table 6 contrast hospital with nonhospital sources and further subdivide hospital sources to contrast infection with contaminant sources. GC5, GC6, icaA, IS256, sesD (bhp), and mecA were associated with hospital sources (Table 6). GC1 and ACME were associated with nonhospital sources. There was no evidence of an association between GC2, GC4, and fdh and nonhospital sources (Table 6). In contrast, GC6 and mecA were associated with an infection source, and no characteristic was associated with contaminant sources.
Associations of genetic clusters and selected genetic markers with isolation sourcesa
Prediction of isolation sources with GCs and genetic markers.Support vector machines (SVMs) were used to predict isolation sources with all six GCs and the five genetic markers that were associated with isolation sources in bivariate analyses. Performance measures were averaged over 10 replicates of cross-validating parameters, training, and testing of SVMs with random 70:30 partitions of the sample as described in Materials and Methods. GCs predicted hospital and nonhospital sources with an accuracy of 80%, and the prediction of a hospital source when the isolate was from the hospital was much better (90% sensitivity) than the prediction of a nonhospital source when the isolate was from nonhospital carriage (49% specificity) (Table 7). Genetic markers predicted hospital and nonhospital sources with an accuracy of 78%, which was indistinguishable from the accuracy achieved with GCs, considering the broad confidence intervals. As with the accuracy achieved with GCs, the accuracy achieved with the markers was mostly due to the ability to distinguish the hospital sources (92% sensitivity, 50% specificity). In contrast, neither GCs nor markers performed well in analyses predicting infection and contaminant sources; the accuracy was <53% for both predictors (Table 7). Clone-corrected samples had similar levels of accuracy with broader confidence intervals than the samples that included all isolates, but they had larger differences in sensitivity and specificity in analyses predicting infection versus contaminant sources (Table 7). As noted previously, only two characteristics were associated with infection source (GC6 and mecA; Table 6) and no characteristic was associated with contaminant source. The SVMs performed poorly under these conditions and appear to have sometimes overfitted the training data (i.e., the SVMs picked the predominant class from the training set).
Performance of genetic clusters and selected genetic markers in predicting isolation source with SVMs
Post hoc analysis of isolation sources.Although isolation sources were not defined using genetic information, it might be instructive to reevaluate sources in light of this added information. In particular, we expect multiple infection isolates from the same patient to often be indistinguishable genetically, allowing for some intrahost evolution of the bacteria. For 20 (83%) of 24 patients with multiple infection isolates, all infection isolates from a given patient matched by GC, and for 13 (54%) of 24 patients, all infection isolates from a given patient were found to match by the five genetic markers. Note, however, that the markers include several mobile genetic elements and are not intended for strain identification. On the other hand, among the 17 patients who were deemed to have both infection and contaminant isolates, we expect the isolates from these different sources to often differ genetically. All contaminant isolates were different from all infection isolates from a given patient in only 4 (24%) of 17 patients in analyses considering GCs and 7 of 17 (41%) patients in analyses considering markers.
These results suggest that our sampling procedures adequately captured true infection isolates, but they also suggest that distinguishing contaminants from infection isolates from the same patient on the basis of colony morphology, as is common practice in some hospital laboratories, may not be ideal. To determine the impact of some potentially misclassified contaminant isolates on our analysis, we reran the SVMs after removing all 21 contaminant isolates from infected patients, leaving the 34 unambiguous contaminant isolates from patients with single blood cultures and evidence against infection upon chart review. While the results of analysis of the ability to distinguish hospital from nonhospital sources were very similar to those of the previous analysis (77% and 78% accuracy by GCs and markers, respectively), there was a 12% to 16% increase in accuracy in distinguishing infection from contaminant sources in comparison to the previous analysis (61% and 64% accuracy by GCs and markers, respectively).
DISCUSSION
In pioneering work on the population genetic structure of S. epidermidis, MLST data were analyzed using the eBURST algorithm and most STs were classified into one clonal complex (22). Subsequent studies reported some instabilities in this classification scheme as the MLST database grew from 74 STs to 211 STs (25). With other species of recombining bacteria, Bayesian clustering tools that model genetic admixture have helped to define population structure (39, 40). Recently, we used a Bayesian clustering approach with S. epidermidis MLST data, including all 437 STs in the international database, and identified six genetic clusters (GCs) (26). Here, we confirmed the presence of these six GCs in an updated database of 578 STs. A total of 96% of previously studied STs were classified into the same GCs with the enlarged database, and all differently classified STs involved the recombinant GC3.
In a clinical setting, collecting and analyzing MLST data may not be practical, but it is not a stretch to consider implementing SNP typing and analysis using various multiplex platforms already operational in many laboratories (41). Diverse sets of SNPs have been used in several studies for typing staphylococci (42–44). Here, we used the GST statistic to select those SNPs from MLST data that best distinguish the six GCs. The seven selected SNPs correctly and confidently assigned 94% of the 578 STs to their GC, which indicates that small sets of SNPs can provide a reliable foundation for a rapid assay of S. epidermidis genetic background.
Previous work indicated that S. epidermidis GCs may reflect the species' ecology to some extent (26). Specifically, associations were found between GCs and genetic markers of isolation sources in clinical specimens from New York, but that study did not attempt to distinguish infection from contaminant isolates and it did not include nonhospital carriage isolates (26). Here, study of isolates from both clinical and nonclinical samples from Illinois replicated several of the previously observed GC-marker associations and allowed associations between GCs and isolation sources to be measured for the first time. GC5 was confirmed to be associated with icaA, IS256, and mecA, and all isolates but one were from a hospital source, supporting the notion that this cluster is a hospital specialist. On the other hand, GC1 and GC6 did not have consistent associations with genetic markers across studies, and they differed from each other in their associations with isolation sources. Studies of isolates from other geographic areas are needed to assess whether GC1 and GC6 exhibit wide variation in their marker profiles and isolation sources, as might be expected of generalists.
Hospital-associated populations have been identified in other bacterial species that are opportunistic pathogens. Willems et al. (40) identified three hospital-associated populations of Enterococcus faecium using Bayesian clustering of MLST data, which subdivided the CC17 group previously defined by eBURST analysis of MLST data. Each of the three populations was significantly underrepresented for admixed STs (40); however, subsequent analysis of genome sequences from representatives of these populations identified an important role for recombination in generating their diversity (45). By comparison, the MLST data for S. epidermidis suggest relatively more recombination in hospital-associated GC5 and less recombination in hospital-associated GC6, whereas a subsequent genomic analysis that placed GC5 and GC6 together in a group with GC1 showed recombination in all three of these backgrounds (27). These results indicate that hospital-associated populations of S. epidermidis may not be isolated from recombination with nonhospital populations as has been proposed for E. faecium.
GC3 was confirmed to be a highly recombinant genetic cluster of S. epidermidis. The previous analysis of the MLST database of 437 STs (26) and the current analysis of the larger database of 578 STs both showed that GC3 has a higher proportion of admixed STs and a higher proportion of admixed nucleotides than other GCs. These results are consistent with the genomic analysis reported by Méric et al. (27), which showed GC3 isolates to be the most recombinant. The genetic and/or ecological basis for recombinant character of GC3 and its role in the diversification of S. epidermidis populations require further study.
GC2 and GC4 were the sole backgrounds for the fdh gene, and all isolates belonging to these two GCs were positive for fdh. This gene was proposed by Conlan et al. (14) as a marker for commensal isolates. Here, the GC2 and GC4 isolates were relatively rare overall, but they were not overrepresented by nonhospital carriage isolates. Our data suggest that fdh is a marker of these particular GCs rather than a marker of a commensal lifestyle. Despite their rarity in the sample, GC2, GC3, and GC4 tended to be more diverse in SNP types than GC1, GC5, and GC6. Of note, SNP types extracted from draft genome sequences of S. epidermidis from wild mouse species (46) as well as from an unusual enterotoxin-producing human clinical isolate (47) can be reliably classified into GC4 (I. E. Tolo and D. A. Robinson, unpublished data). Together, these observations may indicate that some of these rare GCs represent a large, scantly sampled population with an ecological niche that is broader than that of the skin of healthy humans.
The goal of this study was to test the hypothesis that GCs could distinguish isolation sources. Using a supervised machine learning algorithm, no significant differences were observed in the accuracy of predicting isolation sources with either GCs or a set of five genetic markers that might more directly relate to pathogenicity. While both GCs and markers predicted hospital and nonhospital sources with about 80% accuracy, they predicted infection and contaminant sources within the hospital only about half the time. These results indicate that hospital and nonhospital sources are better distinguished than are different populations within hospitals. Infection isolates might be selected at random from a population that has evolved fitness for hospital settings.
Our study had some limitations. One potential source of error, evaluated in the post hoc analysis of sources, comes from the selection of contaminant isolates from infected patients using colony morphology as the discriminator. Even though this reflects a “real world” approach to identifying contaminants in some hospital laboratories, these potential misclassifications of source make the infection and contaminant sources appear to be more similar to each other. Here, isolate selection attempted to minimize false positives with respect to infection, and very few of the multiple infection isolates may have been inadvertent contaminants. Thus, while blood culturing and sepsis diagnosis remain complex processes, involving blood sampling techniques, laboratory procedures, and clinical assessments (8, 48), SNP-based characterization of two or more isolates from the same patient may aid in diagnosing “true” infection in some individual patients.
The use of relatively small sample sizes of the different sources was another limitation of our study that resulted in broad confidence intervals for accuracy and some overfitting of the training data in analyzing subsets of the sample. Sharma et al. (23) used SVMs directly with S. epidermidis MLST data and reported a slightly lower prediction accuracy (73%) that was partially attributed to the small sample size of 100 isolates and the high diversity of STs. Here, using a sample size of 154 isolates, clustering of isolates into GCs, and two-class prediction with cross-validated SVM parameter values, it was possible to achieve slightly higher, but still generalizable, prediction accuracy. However, we anticipate that the greatest gains in predicting the sources of S. epidermidis isolates solely from bacterial characteristics will come from studying well-sampled genome sequences for informative polymorphisms, which might be exploited for diagnostic assays using an approach similar to that outlined in this report.
FOOTNOTES
- Received 22 December 2015.
- Returned for modification 15 January 2016.
- Accepted 7 April 2016.
- Accepted manuscript posted online 13 April 2016.
Supplemental material for this article may be found at http://dx.doi.org/10.1128/JCM.03345-15.
- Copyright © 2016, American Society for Microbiology. All Rights Reserved.