Whole-Genome Sequencing Reveals the Contribution of Long-Term Carriers in Staphylococcus aureus Outbreak Investigation

ABSTRACT Whole-genome sequencing (WGS) makes it possible to determine the relatedness of bacterial isolates at a high resolution, thereby helping to characterize outbreaks. However, for Staphylococcus aureus, the accumulation of within-host diversity during carriage might limit the interpretation of sequencing data. In this study, we hypothesized the converse, namely, that within-host diversity can in fact be exploited to reveal the involvement of long-term carriers (LTCs) in outbreaks. We analyzed WGS data from 20 historical outbreaks and applied phylogenetic methods to assess genetic relatedness and to estimate the time to most recent common ancestor (TMRCA). The findings were compared with the routine investigation results and epidemiological evidence. Outbreaks with epidemiological evidence for an LTC source had a mean estimated TMRCA (adjusted for outbreak duration) of 243 days (95% highest posterior density interval [HPD], 143 to 343 days) compared with 55 days (95% HPD, 28 to 81 days) for outbreaks lacking epidemiological evidence for an LTC (P = 0.004). A threshold of 156 days predicted LTC involvement with a sensitivity of 0.875 and a specificity of 1. We also found 6/20 outbreaks included isolates with differing antimicrobial susceptibility profiles; however, these had only modestly increased pairwise diversity (mean 17.5 single nucleotide variants [SNVs] [95% confidence interval {CI}, 17.3 to 17.8]) compared with isolates with identical antibiograms (12.7 SNVs [95% CI, 12.5 to 12.8]) (P < 0.0001). Additionally, for 2 outbreaks, WGS identified 1 or more isolates that were genetically distinct despite having the outbreak pulsed-field gel electrophoresis (PFGE) pulsotype. The duration-adjusted TMRCA allowed the involvement of LTCs in outbreaks to be identified and could be used to decide whether screening for long-term carriage (e.g., in health care workers) is warranted. Requiring identical antibiograms to trigger investigation could miss important contributors to outbreaks.

T o manage Staphylococcus aureus outbreaks effectively, infection control practitioners need to determine the relatedness of isolates from suspected cases. Wholegenome sequencing (WGS) has shown superior resolution compared with standard typing techniques (spa, pulsed-field gel electrophoresis [PFGE]) when used for individual outbreaks (1)(2)(3)(4) and can also provide additional information about resistance, pathogenicity, and population structure (5)(6)(7)(8). However, it has been argued that the accumulation of within-host diversity during S. aureus carriage could result in erroneous inferences about transmission. This has been cited as a potential weakness in applying sequencing to S. aureus outbreaks and may lead to the misinterpretation of genuine transmission routes (1,9,10).
However, rather than within-host diversity being a limitation on sequencing-based outbreak investigation, it could in fact be exploited to determine whether a long-term carrier is implicated in maintaining an outbreak. This information could be used by infection control practitioners when considering whether or not to deploy extended screening (e.g., of health care workers).
In this study, we tested the hypothesis that WGS can be used to predict the presence of a long-term carrier as an outbreak source. First, we examined individuals with newly acquired S. aureus nasal carriage to ascertain whether diversity is present at acquisition or develops over time. Next, we analyzed 20 S. aureus outbreaks, which were previously investigated using standard typing techniques, to assess the added utility of WGS. Finally, we compared WGS with epidemiological data to determine whether the presence of a long-term carrier maintaining the outbreak could be inferred from the WGS data.

RESULTS
Comparison of within-host diversity in newly acquired and long-term carriage. Eight subjects were identified with Ն3 consecutive bimonthly negative nasal swabs, followed by Ն1 year of swabs consistently positive for S. aureus. All isolates were methicillin-susceptible S. aureus (MSSA), representing 7 spa types, 5 sequence types, and 4 clonal complexes. The median time from the first to last positive sample was 490 days (range, 358 to 727 days). In total, 135 isolates were successfully sequenced from 16 samples. One isolate (case 1219, early sample) failed quality checks and was excluded.
In 6/8 subjects, there was a significant increase in mean pairwise diversity (MPWD) between the first and last samples (P Ͻ 0.05) (Fig. 1). In one participant (case 1236), the increase was not significant (P ϭ 0.52), and for another (case 1375), there was a decrease which was marginally significant (P ϭ 0.07). Overall, MPWD increased from 0.88 single nucleotide variants (SNVs) (95% confidence interval [CI], 0.65 to 1.11) to 3.30 (95% CI, 2.92 to 3.68) between the first and last samples (P Ͻ 0.001). An analysis of the phylogenetic trees (see supplemental material) showed highly clonal early populations, and in 2 participants, only a single strain was observed. One individual (case 1219) had a more diverse early sample (MPWD, 4.57; 95% CI, 3.10 to 6.04) compared with those of the other participants. This subject's first positive swab was at month 12, and they had completed a course of amoxicillin-clavulanic acid 1 day before their final negative swab. Therefore, it is possible that this was a false negative due to antibiotic suppression, meaning that there may have been up to 4 months of carriage prior to the first positive swab, accounting for the increased diversity.
Two participants (cases 1218 and 1219) shared the same address and had isolates of the same spa type. Participant 1219 (donor) became positive 2 months before participant 1218 (recipient). On a direct comparison of both early populations, we found that the recipient had an entirely clonal initial population, identical to 4/8 of the donor's strains (see supplemental material).
For an additional 13 participants positive at the study entry, within-host diversity as measured by MPWD ranged from 0 SNVs (3 individuals) to 26 SNVs. This may be due to differences in the acquisition time to the time of the first sample, which is unknown for these individuals.
Overall, isolates from 391 cases were sequenced. Nine (2.3%) were from health care workers (HCWs), the remainder being from patients or household members. Outbreak samples represented 9 clonal complexes, 11 sequence types, and 12 spa types.
Phylogenetic analysis of outbreaks. Phylogenetic trees for each outbreak are provided in the supplemental material. Two outbreaks had isolates which were equally or more distant than comparator isolates despite having the outbreak pulsotype: outbreak D (one isolate, 53 SNVs from the index case compared with 21) and outbreak S (two isolates, 49 and 46 SNVs from the index case compared with 46). These were therefore considered to be sporadic nonoutbreak isolates and were excluded from further analysis.
The overall MPWD across all outbreak sample pairs for the remaining 388 isolates was 13.8 SNVs (95% CI, 13.6 to 13.9) compared with 4,444 SNVs for nonoutbreak spa-matched pairs (95% CI, 2,492 to 6,395) and 30,192 SNVs for nonoutbreak isolates from the same units (95% CI, 29,781 to 30,603). All outbreak isolates were Յ30 SNVs from the index case; 381/388 (98%) were Յ10 SNVs from their nearest neighbor. The 7 more distant isolates came from outbreaks lasting more than 6 months (B, G, and S). All isolates were mapped to a standard reference genome; mapping to an alternative reference strain (performed for 6 outbreaks) yielded only 2 additional SNVs overall (see supplemental material), with no effect on topology.   Time to most recent common ancestor and long-term carriers. Twelve outbreaks (60%) had epidemiological evidence of a long-term carrier (LTC). Three included cases with recurrent staphylococcal disease, in 5 an LTC was suspected due to nonoverlapping ward stays, and in 4, at least one case had postoutbreak long-term carriage (Fig.  2). The pairwise distances between isolates from outbreaks with evidence for an LTC ranged from 0 to 46 SNVs compared with 0 to 10 SNVs for outbreaks with no evidence for an LTC ( Table 2). The mean duration-adjusted time to most recent common ancestor (TMRCA) for outbreaks with a suspected or proven LTC was 243 days (95% highest posterior density interval [HPD], 143 to 343 days) compared with 55 days (95% HPD, 28 to 81 days) for outbreaks with no evidence for an LTC (P ϭ 0.004) (Fig. 2). Excluding postoutbreak carriage, an analysis of the receiver operating characteristic curve gave an area under the curve (AUC) of 0.953 (95% CI, 0.851 to 1). Using the Youden index to  In 6/20 outbreaks, antimicrobial susceptibility differed across isolates, confirmed by the presence/absence of mobile resistance determinants identified using BLAST (11); however, these clearly belonged to the outbreak on phylogenetic analysis. MPWD between isolates sharing an antibiogram was 12.7 SNVs (95% CI, 12.5 to 12.8) compared with 17.5 (95% CI, 17.3 to 17.8) for isolates with differing antibiograms (P Ͻ 0.0001), although a substantial number of isolate pairs with different antibiograms had 0 SNVs between their core genomes (Fig. 3).
For other factors potentially related to outbreak diversity, there was no evidence of an association between MPWD and outbreak duration, reason for investigation, epidemiological setting, or MRSA phenotype (P Ͼ 0.05).

DISCUSSION
We have tested the use of WGS for S. aureus outbreak investigation using 20 outbreaks. By comparing observed outbreak SNV distances with nonoutbreak and spa/multilocus sequence type (MLST)-specific diversities, we were able to distinguish outbreak from nonoutbreak strains.
Our observation of minimal diversity in recent acquisitions of nasal carriage is reassuring for the application of WGS data to outbreaks. For the donor-recipient pair, we observed a narrow transmission bottleneck, with a clonal founding population despite a diverse donor population. Although this is a single case, the findings are supported by the minimal diversity seen in the early samples for the majority of carriage study subjects, and further evidence for a narrow transmission bottleneck is provided by the relatively short SNV distances observed across the outbreaks. Taken together, these findings suggest that, in an acute short-term outbreak, there will be insufficient time for diversity to accumulate.
If WGS is to be used routinely for outbreak investigation, these findings provide evidence that single-colony sequencing is likely to identify clusters reliably in this context, allowing for the ease of interpretation and ensuring that WGS remains an affordable alternative to standard typing, as a requirement for sequencing multiple colonies per case, as implied by previous investigators (1, 10), would rapidly escalate costs and render WGS too expensive for routine use.
Previous carriage studies have found greater distances than seen here (9, 11); however, these did not account for the estimated time of acquisition. We postulate that the existence of a significant cloud of diversity (4, 12) may be a marker of long-term carriage. Therefore, in outbreaks, higher diversity may indicate the involvement of an LTC, with outbreak diversity reflecting the donor cloud.
In support of this, we observed a significant difference in duration-adjusted TMRCAs between outbreaks with and without evidence of an LTC. The longest TMRCAs were in hospital outbreaks with indirect links between cases (i.e., nonoverlapping ward stays). The likelihood of "missed" cases in these outbreaks was considered low due to the enhanced screening, and the most likely reason for the reoccurrence of the outbreak strain was thought by the investigating teams to be either a reintroduction from the community (outbreak G) or from a staff member with long-term carriage (outbreaks A, I, N, and S). Staff carriage was proven in one outbreak (by sampling and subsequent termination of the outbreak on their exclusion), but in the remaining outbreaks, HCWs were either not sampled or HCW sampling was anonymized and positive results could not be linked definitively with the suspected carrier.
The outbreaks included in this study necessarily reflect the circulating S. aureus clones in the United Kingdom and the concerns of local infection control teams. The sampling frame is therefore enriched for MRSA and PVL-positive outbreaks and those from neonatal units. Despite this, there is a wide representation of sequence types.
In spite of the enhanced surveillance during each outbreak, there inevitably are missing transmission links, due to missed sampling, suppression from antimicrobial therapy, or delays in identifying contacts. One reason for missed samples may be the use of antibiograms as an initial screening tool for identifying putative outbreak isolates, as most investigating teams only collected isolates with identical or highly similar antimicrobial susceptibility profiles. However, in the six outbreaks where isolates were included with differing antibiograms, the core genomes were remarkably conserved. This is presumably due to the ready loss/gain of mobile genetic elements (13) and shows that reliance on antibiograms may lead to samples being wrongly excluded.
The variability of mobile elements is also important for interpreting genetic distances. Recombination events such as the gain/loss of a mobile element will introduce a large number of SNVs even though this represents a single genetic event. Current analysis tools which can accommodate this are computationally complex and, for large data sets, require sizable computing resources. A simpler approach is to exclude the "mobile-ome" from phylogenetic analyses and compare only the core genome, and the results above demonstrate that this is an acceptable strategy. Similarly, mapping to alternative reference strains (performed for six outbreaks) had minimal effects on SNV analysis and phylogeny, removing the need for identification of clonal complex or assembly of index case prior to phylogenetic analysis. This streamlined approach brings WGS closer to routine use, as a readily deployable method with a minimal burden of computational time and bioinformatics expertise.
In conclusion, we have demonstrated how a WGS-based approach can be applied to S. aureus outbreak investigations. We have shown that current sampling strategies provide sufficient information to determine whether isolates belong to an outbreak, and that, rather than confounding the investigation, within-host diversity can be utilized to identify the possible involvement of a long-term carrier, potentially enhanc-ing the infection control response. Combining this with directed multisampling of suspected LTCs (1) may be a cost-effective method of using WGS to ensure that, where HCWs are implicated, potentially career-altering decisions are made using the best possible evidence.

MATERIALS AND METHODS
Comparison of within-host diversity in newly acquired and long-term carriage. Eight participants were identified from a population study of S. aureus nasal carriage in adults attending general practices in Oxfordshire (15), in which participants had nasal swabs taken at two-month intervals, with positive samples stored as mixed glycerol stocks taken by sweeping across multiple colonies on the primary plates to preserve the diversity of carried strains (11). The eight participants were negative for nasal carriage at recruitment and subsequently had consistently negative swabs for Ն6 months before acquiring a strain which they carried for at least 1 year. The first and last positive samples for each individual were retrieved from the mixed glycerol stocks. Samples were plated on Columbia blood agar (CBA) and incubated overnight at 37°C. For each time point, 8 individual colonies (12 for case no. 1218) were selected and further subcultured on a CBA plate and again incubated overnight at 37°C.
We also retrieved sequencing data from 13 participants previously investigated, (9) for whom the approximate time of acquisition was unknown. Each of these had 8 to 12 individual colonies sequenced.
Collection of outbreak isolates and epidemiological data. Nineteen outbreaks were purposively sampled in collaboration with the Public Health England (PHE) staphylococcal reference laboratory, representing a range of sequence types and epidemiological settings and including both MRSA and MSSA. One further outbreak was investigated in conjunction with Lausanne University Hospital, Switzerland (14,16). Epidemiological information was obtained from each infection control team (specimen date, site, ward location, and where applicable, admission/discharge dates and previous screening results).
For each outbreak, additional background isolates were also included for comparison. We sequenced all isolates submitted to PHE as part of the outbreak investigation, including those identified as "nonoutbreak" by routine typing, to estimate the expected genetic diversity of the outbreak strain and to ensure that the apparent outbreak strains were not part of an ongoing clonal expansion. We also included non-epidemiologically linked isolates matched for spa type and/or MLST to provide a comparison for expected within-spa distances and to provide an outgroup for phylogenetic analysis.
Isolates were retrieved from single-colony frozen stocks held at the PHE reference laboratory, Colindale, London, or at Lausanne University Hospital. We used only the first isolate from each case and included isolates both from clinical samples and screening swabs.
Extraction and sequencing. DNA was extracted and sequenced as previously described (6) from a single colony subcultured on CBA and incubated for 18 to 24 h. Sequencing was performed using the Illumina HiSeq or MiSeq platforms.
Genome assembly and construction of phylogenetic trees. For all outbreaks, reads were aligned using Stampy v1.0.17 to a standard reference genome (MRSA252; GenBank no. NC_002952) (17). Six outbreaks were also mapped to clonal complex-specific reference genomes obtained from in-house collections or GenBank. Single nucleotide variants were identified across all mapped nonrepetitive sites using SAMtools v 0.1.18 mpileup, with the extended base-alignment quality flag and masking of mobile genetic elements. A consensus of Ն75% and Ն5 reads, including one in each direction, were required to support an SNV, and calls were required to be homozygous under a diploid model. Maximum likelihood trees were estimated from the mapped whole genomes using PhyML (18).
Outbreak analysis and calculation of TMRCA. The index case was defined as the earliest microbiologically confirmed case in each cluster. Outbreak cases were defined as those sharing related PFGE pulsotypes (19) plus a definite epidemiological link to the index or secondary cases (Ͼ24-h stay in the same ward or household/classroom/similar community situation with prolonged contact, e.g., childcare). For each outbreak case, the genetic distance in SNVs was calculated from the index case and the nearest neighbor. If an isolate was more distant from the index case than the nearest spa/MLST-matched comparator, it was considered sporadic and excluded from further outbreak analysis.
We classified each outbreak according to the possibility of long-term carrier involvement (LTC carrying for Ն6 weeks) as follows: (i) LTC not suspected, direct contact between cases, or no history of preexisting staphylococcal disease; (ii) evidence for a pre-or perioutbreak LTC, either Ն1 case with a history of recurrent staphylococcal disease or nonoverlapping hospital stays (ward case identified after a case-free interval, indicating a possible health care worker carrier); and (iii) evidence of a postoutbreak LTC, Ն1 case with positive nasal swab Ͼ6 weeks after initial swab (indicating a propensity for long-term carriage).
To evaluate the relationship between outbreak diversity and the likelihood of a long-term carrier, we estimated the time to most recent common ancestor (TMRCA) using BEAST v1.8.1 (20). We applied a simple HKY substitution model with constant population size and a standardized substitution rate of 3.3 ϫ 10 Ϫ6 substitutions per genome per year (7) (see supplemental material). To control for differences in outbreak duration, outbreaks were censored at 6 months, and the (censored) outbreak duration was subtracted from the calculated TMRCA to obtain a duration-adjusted TMRCA.
We compared SNV distances between isolates of identical pulsotypes and those differing by one or more band. To determine whether there was an increase in genetic diversity associated with the acquisition of antimicrobial resistance, we also interrogated the predicted antibiograms as previously described (21).
Statistical analyses were performed using Stata v13.1. Mean pairwise differences were modeled using normal linear regression using robust standard errors to account for dependence within person/ outbreak. The ability of TMRCA to differentiate between outbreaks with evidence for an LTC compared with outbreaks with no evidence for an LTC was evaluated using a receiver operating characteristic curve analysis.
Accession number(s). The sequences reported in this paper have been deposited in the NCBI Sequence Read Archive under bioproject number PRJNA380544.