ABSTRACT
Whole-genome sequencing (WGS) has emerged today as an ultimate typing tool to characterize Listeria monocytogenes outbreaks. However, data analysis and interlaboratory comparability of WGS data are still challenging for most public health laboratories. Therefore, we have developed and evaluated a new L. monocytogenes typing scheme based on genome-wide gene-by-gene comparisons (core genome multilocus the sequence typing [cgMLST]) to allow for a unique typing nomenclature. Initially, we determined the breadth of the L. monocytogenes population based on MLST data with a Bayesian approach. Based on the genome sequence data of representative isolates for the whole population, cgMLST target genes were defined and reappraised with 67 L. monocytogenes isolates from two outbreaks and serotype reference strains. The Bayesian population analysis generated five L. monocytogenes groups. Using all available NCBI RefSeq genomes (n = 36) and six additionally sequenced strains, all genetic groups were covered. Pairwise comparisons of these 42 genome sequences resulted in 1,701 cgMLST targets present in all 42 genomes with 100% overlap and ≥90% sequence similarity. Overall, ≥99.1% of the cgMLST targets were present in 67 outbreak and serotype reference strains, underlining the representativeness of the cgMLST scheme. Moreover, cgMLST enabled clustering of outbreak isolates with ≤10 alleles difference and unambiguous separation from unrelated outgroup isolates. In conclusion, the novel cgMLST scheme not only improves outbreak investigations but also enables, due to the availability of the automatically curated cgMLST nomenclature, interlaboratory exchange of data that are crucial, especially for rapid responses during transsectorial outbreaks.
INTRODUCTION
Listeria monocytogenes is a facultative anaerobe, a Gram-positive, psychrophilic and salt-tolerant, facultative intracellular pathogen of humans and animals, causing clinical manifestations like gastroenteritis, encephalitis, meningitis, and septicemia. A high hospitalization rate of >90% and a case-fatality rate up to 30% make L. monocytogenes an important human pathogen (1). The characteristic traits (growth at low temperatures, survival of freezing and high-salt and nitrite preservation methods, and biofilm formation) of L. monocytogenes represent a major issue for industrialized food production and facilitate food contamination at several stages of food production (2). Nearly all cases of listeriosis are caused by consumption or use of contaminated food or feed.
The traditional L. monocytogenes serotyping scheme allows the differentiation of 12 serotypes of which 4b, 1/2a, and 1/2b isolates cause about 96% of all reported human listeriosis cases (3). Low discriminatory power, insufficient reproducibility, and antigen sharing between serotypes impede the value of serotyping in outbreak investigations and necessitate more accurate and more discriminatory typing solutions (4).
Pulsed-field gel electrophoresis (PFGE) has been established as the current “gold standard” for L. monocytogenes typing by PulseNet (5, 6) and has been essential for outbreak investigation worldwide (7). However, PFGE is time-consuming, expensive, and difficult to standardize (8, 9). Methods based on DNA sequence analysis appear more promising for fast, accurate, and reproducible strain typing (10). Whereas multilocus sequence typing (MLST) (11, 12) and multi-virulence-locus sequence typing (MVLST) (13) schemes for L. monocytogenes share the characteristics of sequence-based methods, they both lack the discriminative power needed for outbreak investigation of this clonal pathogen (7, 14).
Nowadays, the recent and ongoing evolution of sequencing technologies from Sanger sequencing to next-generation sequencing enables sequence analysis on a whole-genome level. Several studies on a variety of bacterial species have already shown that whole-genome sequence (WGS)-based typing, based either on single nucleotide variants (SNVs) (15, 16) or on gene-by-gene allelic profiling of core genome genes, frequently named core genome MLST (cgMLST) or MLST+ (17, 18), currently represents the ultimate diagnostic tool for strain typing. Recently, we successfully applied a cgMLST typing approach to L. monocytogenes (19). Nevertheless, the broad use of WGS-based approaches is still hampered by the lack of standardized nomenclature that would facilitate global exchange of data, as has already been the reality for classical MLST data (20) for more than a decade.
To achieve a stable cgMLST scheme for L. monocytogenes that can form the basis of a standardized nomenclature for WGS-based L. monocytogenes typing, first we defined an L. monocytogenes core genome gene set representing the genetic diversity within the L. monocytogenes population based on well-characterized reference strains, and second we challenged this scheme for suitability in outbreak investigations using isolates from two outbreaks and sporadic cases.
MATERIALS AND METHODS
Microorganisms and DNA extraction.All strains and genome sequences used for the development of the novel cgMLST L. monocytogenes scheme are listed in Table 1. For subsequent evaluation of the scheme, a total of 67 L. monocytogenes isolates from sporadic cases (n = 8 isolates, that served also as outgroups for the outbreaks with matching serotypes and highly similar or even identical PFGE pattern) and two outbreaks (n = 42) (21–23) with reference strains for all serotypes (n = 17) were used (Table 2). All strains were cultured overnight at 37°C on RAPID'L.Mono agar (Bio-Rad, Vienna, Austria) for species confirmation and subcultivated on Columbia blood agar plates (bioMérieux, Marcy I'Etoile, France) prior to DNA extraction using the GenElute Bacterial Genomic DNA kit (Sigma, St. Louis, MO, USA) according to the manufacturer's instructions.
List of L. monocytogenes strains and genomes used for SeqSphere cgMLST L. monocytogenes target definition
List of L. monocytogenes isolates used for evaluation of the SeqSphere cgMLST L. monocytogenes schemea
Whole-genome sequencing and assembly.Sequencing libraries were prepared using Nextera XT chemistry (Illumina Inc., San Diego, CA, USA) for a 250-bp paired-end sequencing run on an Illumina MiSeq sequencer. Samples were sequenced to aim for minimum coverage of 100-fold using Illumina's recommended standard protocols. The resulting FASTQ files were first quality trimmed and then de novo assembled using the Velvet assembler (24) integrated in Ridom SeqSphere+ software (25) (version 2.3; Ridom GmbH, Münster, Germany). Here, reads were trimmed at their 5′ and 3′ ends until an average base quality of 30 was reached in a window of 20 bases, and the assembly was performed with Velvet version 1.1.04 using optimized k-mer size and coverage cutoff values based on the average length of contigs with >1,000 bp.
BAPS.To determine the overall L. monocytogenes species variation, we applied a Bayesian analysis of population structure (BAPS) (26, 27). All multilocus sequence typing (MLST) data available as of 24 July 2014 (673 sequence types [STs]) were downloaded from the MLST website (14), and all allelic gene sequences per locus were multiple aligned using MUSCLE (28) and finally concatenated for each ST. The BAPS was carried out using the clustering of linked molecular data functionality. Ten runs were performed, setting an upper limit of 20 partitions. Admixture analysis was performed using the following parameters: minimum population size considered, 5; iterations, 50; number of reference individuals simulated from each population, 50; and number of iterations for each reference individual, 10.
cgMLST target gene definition.To determine the cgMLST gene set (named MLST+ in the SeqSphere+ software), a genome-wide gene-by-gene comparison was performed using the MLST+ target definer (version 1.1) function of SeqSphere+ with default parameters. These parameters comprise the following filters to exclude certain genes of the EGD-e reference genome (GenBank accession number NC_003210.1, dated 26 March 2015) from the cgMLST scheme: a minimum length filter that discards all genes shorter than 50 bp; a start codon filter that discards all genes that contain no start codon at the beginning of the gene; a stop codon filter that discards all genes that contain no stop codon or more than one stop codon or that do not have the stop codon at the end of the gene; a homologous gene filter that discards all genes with fragments that occur in multiple copies within a genome (with identity of 90% and >100 bp overlap); and a gene overlap filter that discards the shorter gene from the cgMLST scheme if the two genes affected overlap >4 bp. The remaining genes were then used in a pairwise comparison with BLAST version 2.2.12 (parameters used were word size 11, mismatch penalty −1, match reward 1, gap open costs 5, and gap extension costs 2) with the query L. monocytogenes chromosomes. All genes of the reference genome that were common in all query genomes with a sequence identity of ≥90% and 100% overlap and, with the default parameter stop codon percentage filter turned on, formed the final cgMLST scheme; this discards all genes that have internal stop codons in >20% of the query genomes.
Evaluation of the cgMLST target gene set.To evaluate the applicability and representativeness of the L. monocytogenes cgMLST target gene set, a total of 67 isolates (Table 2) were subsequently analyzed to determine the presence of these target genes. It was assumed that a well-defined cgMLST scheme should cover at least 95% of the cgMLST genes present in all isolates.
To extract the target genes, the default parameters were used in the SeqSphere+ software: (i) for processing options, “Ignore contigs shorter than 200 bases”; (ii) for scanning options, “Matching scanning thresholds for creating targets from assembled genomes” with “required identity to reference sequence of 90%” and “required alignment to reference sequence with 100%”; and (iii) for BLAST options, word size 11, mismatch penalty −1, match reward 1, gap open costs 5, and gap extension costs 2. In addition, the target genes were assessed for quality, i.e., the absence of frame shifts and ambiguous nucleotides. A core genome gene was considered a “good target” only if all of the above criteria were met, in which case the complete sequence was analyzed in comparison to the EGD-e reference. Alleles for each gene were assigned automatically by the SeqSphere+ software to ensure unique nomenclature. The combination of all alleles in each strain formed an allelic profile that was used to generate minimum spanning trees (MST) using the parameter “pairwise ignore missing values” during distance calculation.
In order to maintain backwards compatibility with classical L. monocytogenes MLST, sequences of the seven genes comprising the allelic profile of the MLST scheme were extracted separately from the genome sequences and queried against the L. monocytogenes MLST database in order to assign classical STs in silico.
Nucleotide sequence accession number.All raw reads generated were submitted to the European Nucleotide Archive (http://www.ebi.ac.uk/ena/) under the study accession number PRJEB6551.
RESULTS
BAPS partition and admixture analysis based on 673 STs resulted in seven partitions (see Table S1 in the supplemental material). As BAPS partitions Lm05 and Lm06 comprised exclusively Listeria innocua species isolates of 43 STs with significant admixtures, these two partitions were excluded from further analysis. For the remaining five partitions, three (partitions Lm01, Lm02, and Lm04) were among the available NCBI RefSeq genome sequences of L. monocytogenes. To achieve complete coverage of the L. monocytogenes population, we sequenced six additional strains from Seeliger's Listeria culture collection (SLCC0717, SLCC0759, SLCC1042, SLCC3287, SLCC4771, and SLCC6263), representing the missing BAPS partitions Lm03 and Lm07 (Table 1). In total, 42 genome sequences, including L. monocytogenes strain EDG-e as reference for core genome gene definition were fed into the MLST+ target definer and resulted in 1,701 genes out of 2,867 genes of strain EGD-e (53.2% of the EDG-e strain chromosome nucleotides) (see Table S2 in the supplemental material).The cgMLST scheme was then challenged with two sets of strains: the first contained 17 serotype reference strains representing all serotypes, genetic lineages, and BAPS partitions to determine its ability to cover the whole L. monocytogenes diversity and the second consisted of 48 isolates from two published outbreaks, including eight outgroup isolates (Table 2). All 17 serotype reference strains had ≥99.1% good cgMLST targets (mean, 99.5%), and for all serotype representatives the correct MLST was obtained. Similarly, for the two outbreaks, all isolates had ≥99.3% good cgMLST targets (mean, 99.7%). The results are summarized in Table 2.
The cgMLST scheme was further evaluated for its usability in outbreak investigation, i.e., whether outbreak isolates could be attributed to the same clone, named cluster type (CT) in the context of cgMLST typing, and clearly separated from the outgroup isolates. Therefore, we determined the maximum number of differing genes within each outbreak that reflect putative microevolutionary events. To facilitate cluster investigations in the future, we finally defined the so-called CT threshold that gives the maximum number of differing alleles that are shared by the same CT. In the two retrospectively analyzed outbreaks, a jellied pork outbreak (JPO) in Austria in the year 2008 and two epidemiologically linked clusters forming the acid curd cheese (Quargel) outbreak (ACCO) in Austria, the Czech Republic, and Germany in the years 2009/2010 (Table 2), detailed analysis resulted in a maximum number of 10 differing alleles (see Table S3 in the supplemental material). cgMLST of seven human and two food isolates from the JPO correctly grouped these isolates together with a maximum of four allelic differences (Fig. 1). Outgroup isolates L2708 (ST249) and L7508 (ST4) exhibited more than 1,000 allelic differences, and reference strains F2365 and LL195 (both ST1) exhibited ≥32 allelic differences (Fig. 1). Extraction of classical MLST targets resulted in STs of all outbreak isolates that were identical to those of ST1 and confirmed the previous Sanger sequencing (Table 2).
Minimum-spanning tree based on cgMLST allelic profiles of 9 L. monocytogenes isolates (all share ST1) from the jellied pork outbreak (21) and two outgroup isolates L2708 (ST249) and L7508 (ST4) in comparison to reference strains F2365 (GenBank accession number NC_002973) and LL195 (NC_019556) (both ST1) exhibiting the same serotype 4b. Each circle represents an allelic profile based on sequence analysis of 1,701 cgMLST target genes. The numbers on the connecting lines illustrate the numbers of target genes with differing alleles. The different groups of strains are distinguished by the colors of the circles. Closely related genotypes (≤10 allele difference) are shaded in gray. NCBI RefSeq strains are marked with an asterisk.
cgMLST of 33 isolates from the ACCO correctly identified the two different clones (ACCO I and ACCO II) that caused this outbreak (Fig. 2). Within the ACCO I clone, nine isolates were ST403 and five were ST777, a single locus bglA variant of ST403. cgMLST revealed the same dichotomy as classical MLST; the right branch of the ACCO I tree comprised all ST777 isolates (L27-09, L31-09, L32-09, L35-09, and L68-09). All outgroup isolates (MRL-13-00230, LD27-12, and Ro-015) were ST403 with at least 16 allelic differences compared to the ACCO I isolates. ACCO I isolates displayed a maximum of 10 allelic differences from each other (Fig. 2). ACCO II isolates had a maximum of two allelic differences from each other. All ACCO II isolates were correctly assigned to ST398. The three epidemiologically unrelated outgroup isolates (L38-11, 12025641, and 12025647) with an identical PFGE band pattern (data not shown) also exhibited ST398 and had ≥23 allelic differences compared to the ACCO II food isolates (Fig. 2). ACCO I and ACCO II isolates differed in >1,000 alleles from each other (Fig. 2).
Minimum-spanning tree illustrating the phylogenetic relationship based on the cgMLST allelic profiles of 33 L. monocytogenes isolates from the outbreak associated with acid curd cheese (ACCO) (22, 23) consisting of two clones (ACCO I and ACCO II). Three outgroup isolates per outbreak (with identical PFGE profiles and serotypes) are shown in comparison to the reference strain EGD-e (GenBank accession number NC_003210; ST35). ACCO I isolates L27-09, L31-09, L32-09, L35-09, and L68-09 were ST777; the remaining isolates, including the three ACCO I outgroup isolates were ST403. ACCO II isolates, including the three ACCO II outgroup isolates were all ST398. Each circle represents an allelic profile based on sequence analysis of 1,701 genes. The numbers on the connecting lines illustrate the numbers of target genes with differing alleles. The different groups of strains are distinguished by the colors of the circles. Closely related genotypes (≤10 allele difference) are shaded in gray. The NCBI RefSeq strain is marked with an asterisk.
DISCUSSION
In outbreak situations, a rapid, accurate, and standardized classification of bacterial isolates is essential. Since its introduction in 1998, MLST has become a proof-of-principle method for sequence-based typing methods with a unique centrally curated and thereby standardized nomenclature (20). Building on these experiences nowadays, it is possible to analyze thousands of genes using next-generation sequencing, which dramatically increases discriminatory power and thereby now enables outbreak investigations (18, 19, 29–32). In our study, we were able not only to show that our cgMLST typing scheme is representative for the breadth of the L. monocytogenes population with ≥99.1% successfully extracted cgMLST targets but also to differentiate outbreak from nonoutbreak isolates clearly.
The microevolutionary events within each outbreak and the CT threshold of ≤10 differences warrant further comments. Within the first outbreak, the JPO (21), very few allelic changes were detected and the maximum allelic distance within the outbreak was only four alleles. This high similarity reflects the outbreak situation without much time for intraoutbreak microevolution, because all patients belonged to one travel group and became ill after consuming contaminated jellied pork at an Austrian tavern (21). The ACCO cluster of listeriosis occurred from 2009 until 2010 in Austria, Germany, and the Czech Republic and was caused by contaminated acid curd cheese (Quargel) (22, 23). Further epidemiological and molecular outbreak investigations revealed that two different serotype 1/2a clones with distinct PFGE patterns and inlB STs were responsible for this outbreak (33). Interestingly, a recent study focusing on the comparative genomics of the two outbreak clones revealed significant differences in virulence (34). Again, cgMLST analysis corroborated these findings, and the number of differing alleles among the outbreak clones again reflected the outbreak length. Whereas the ACCO I isolates were found over a period of 8 months and up to 10 different alleles were detected; isolates of the ACCO II were found only during a 3-month period, where at maximum two different alleles were recorded. Therefore, we assume that ACCO I isolates are a representative microevolutionary model for the CT threshold determination to facilitate outbreak investigations using cgMLST. Although the software supports outbreak investigation by providing the CT, this does not release the epidemiologist from thorough investigation.
MLST and cgMLST both use alleles and not nucleotide polymorphisms as units of comparison. Irrespective of the number of nucleotide polymorphisms involved, each allelic change is numbered as a single event; i.e., an allelic change is related to at least one point mutation but can also contain several nucleotide changes. This principle covers the conflicting signals of horizontal and vertical transfer of genetic material and considers the higher frequency of recombination than point mutations in bacteria (30, 32). One major advantage of such an allele-based approach is easy storage and curating the nomenclature in a central database, which is obligatory to guarantee universal nomenclature. For classical MLST, this scenario was one of the key factors to success. However, manual curation of the current MLST databases frequently hampers the rapid use of novel allelic sequences as human intervention is necessary to assign new alleles and STs. With the software solution used here, it was already possible to automatically assign novel cgMLST alleles, after dedicated quality control of the read and assembly data. This automation is crucial as the vast amount of sequencing data is not humanly readable anymore in a reasonable time frame that is needed for effective implementation of hygiene measures during outbreaks. The immediate and automated assignment of novel alleles also enables any software user to access identical nomenclature for L. monocytogenes cgMLST typing, a prerequisite for successful interlaboratory exchange of data. In the future, it is desirable to have an open Internet-based nomenclature server that is able to be interrogated by any software or user (35). The SpaServer (http://spaserver.ridom.de), which automatically hosts the nomenclature of the Staphylococcus aureus protein A gene typing (spa) and now contains >300,000 typing entries originating from >100 countries, might serve as a blueprint for such service (36).
Our approach has one limitation. The analysis is reduced to coding regions only because the second-generation sequencing instruments currently in use produce only relatively short reads that do not assemble the frequently highly repetitive intergenic regions well, leading to faulty assemblies. Therefore, when second-generation sequencing machines are used, focusing on coding regions helps to improve the analytical quality. This might change when third-generation sequencing instruments that produce much longer reads from a single molecule are widely available, preferably as benchtop systems. Nevertheless, the current cgMLST approach will be sustainable as it will maintain backward compatibility with expansion of typing schemes to present typing as we see today with the in silico extraction of classical MLST STs from WGS data.
In conclusion, we established a highly representative cgMLST scheme for WGS-based typing of L. monocytogenes and demonstrated both a high discriminatory power and concordance to previous findings in different outbreak scenarios. The remaining challenge is to establish an Internet-based nomenclature server that can be interrogated like the current MLST servers to facilitate universal global nomenclature for any user.
ACKNOWLEDGMENTS
This work was funded by the European Community's Seventh Framework Program (grant FP7/2007-2013 to D.H.) under grant agreement 278864 in the framework of the EU PathoNGenTrace project, by the Medical Faculty of the University of Münster (grant BD9817044 to AM), and by a Leonardo da Vinci Global Training grant to H.L.F.
D.H. is codeveloper of the Ridom SeqSphere+ software mentioned in the paper, a development of Ridom GmbH (Münster, Germany), which is partially owned by him. The other authors declare no conflicts of interest.
FOOTNOTES
- Received 4 May 2015.
- Returned for modification 5 June 2015.
- Accepted 13 June 2015.
- Accepted manuscript posted online 1 July 2015.
Supplemental material for this article may be found at http://dx.doi.org/10.1128/JCM.01193-15.
- Copyright © 2015 Ruppitsch et al.
This is an open-access article distributed under the terms of the Creative Commons Attribution-Noncommercial-ShareAlike 3.0 Unported license, which permits unrestricted noncommercial use, distribution, and reproduction in any medium, provided the original author and source are credited.