Consensus Sequence-Based Scheme for Epidemiological Typing of Clinical and Environmental Isolates of Legionella pneumophila

ABSTRACT A previously described sequence-based epidemiological typing method for clinical and environmental isolates of Legionella pneumophila serogroup 1 was extended by the investigation of three additional gene targets and modification of one of the previous targets. Excellent typeability, reproducibility, and epidemiological concordance were determined for isolates belonging to both serogroup 1 and the other serogroups investigated. Gene fragments were amplified from genomic DNA, and PCR amplicons were sequenced by using forward and reverse primers. Consensus sequences are entered into an online database, which allows the assignment of individual allele numbers. The resulting sequence-based type or allelic profile comprises a string of the individual allele numbers separated by commas, e.g., 1,4,3,1,1,1, in a predetermined order, i.e., flaA, pilE, asd, mip, mompS, and proA. The index of discrimination (D) obtained with these six loci was calculated following analysis of a panel of 79 unrelated clinical isolates. A D value of >0.94 was obtained, and this value appears to be sufficient for use in the epidemiological investigation of outbreaks caused by L. pneumophila. The D value rose to 0.98 when the results of the analysis were combined with those of monoclonal antibody subgrouping. Sequence-based typing of L. pneumophila is epidemiologically concordant and discriminatory, and the data are easily transportable. This consensus method will assist in the epidemiological investigation of L. pneumophila infections, especially travel-associated cases, by which it will allow a rapid comparison of isolates obtained in more than one country.

Many phenotypic and genotypic methods have been applied to the epidemiological typing of Legionella pneumophila (4,12,15,21,23,28,29,(34)(35)(36), the principal cause of the majority of cases of legionellosis (22). Members of The European Working Group for Legionella Infections (EWGLI) have previously evaluated a number of genotypic methods for the epidemiological typing of L. pneumophila, which is particularly applicable to cases of travel-associated legionellosis. The aim of such studies has been to achieve a standardized protocol that allows the exchange of typing data rather than strains across countries and borders, the latter of which is increasingly difficult, costly, and time-consuming (8)(9)(10). One of these methods, a singleendonuclease, amplified fragment length polymorphism analysis method by which the patterns are resolved by standard agarose electrophoresis (8,34), was adopted as an international standard (9) and is widely used by the members of EWGLI. However, while this method allows relatively rapid screening of isolates within a single laboratory, interlaboratory results from comparison of the profiles contained in a webenabled identification library database revealed that a significant proportion of laboratories could not achieve correct identification 100% of the time (8; B. Afshar, N. K. Fry, and T. G. Harrison,Abstr. 19th Annu. Meet. Eur. Working Group Le-gionella Infections, Chamonix, France, 2004). Therefore, we sought to develop a simple, rapid, and discriminatory typing method which would be truly portable to aid with the investigation of legionellosis and characterization of L. pneumophila isolates.
The aims of this study were to (i) seek to improve the level of discrimination by investigation of the additional genes, asd, mip, and pilE, and to modify the protocol for one of the previous gene targets (mompS) and (ii) formally describe a consensus sequence-based epidemiological typing scheme and establish an online database which would allow the ready characterization of clinical and environmental isolates of L. pneumophila. Bacterial strains. A total of 105 clinical and environmental isolates of L. pneumophila were analyzed, including 96 L. pneumophila sg 1 isolates and 9 L. pneumophila isolates from other serogroups. Ninety-five of the L. pneumophila sg 1 isolates from nine European countries were selected to produce one epidemiologically unrelated panel of 79 clinical isolates (panel 1), one epidemiologically related panel of 16 isolates (panel 2), and one stability panel of five isolates (panel 3). Each of these isolates was obtained from the EWGLI Legionella culture collection and has a unique number (European Union Legionella culture collection number; see http://www.ewgli.org). This collection of isolates was established by members of the EWGLI to facilitate epidemiological typing studies. Each isolate has previously been extensively characterized, and details for these isolates have been described previously (8)(9)(10)(11). The epidemiologically related panel (panel 2) comprised five sets of epidemiologically related isolates and two replicates of the same isolate. The stability panel (panel 3) comprised five variants of the same strain.

Participants
A number of clinical and environmental non-sg 1 isolates of L. pneumophila recovered during the course of outbreak investigations were also investigated, including some belonging to sg 6 (n ϭ 2), sg 8 (n ϭ 2), and sg 10 (n ϭ 2), together with a number of reference strains for sg 1 (3), sg 6 (16, 24), sg 8 (1), and sg 10 (25) obtained from the National Collection of Type Cultures, London, United Kingdom.
Strategy. Six gene targets were investigated, asd, flaA, mip, mompS, pilE, and proA. Preliminary data have previously been described for flaA, proA, and mompS (11). For this study, evaluation was undertaken as outlined in the Consensus Guidelines of the European Study Group on Epidemiological Markers (32) and previous studies (8)(9)(10)(11). Typeability (T) was calculated as the proportion of isolates assigned to a type by sequencing of each target gene. Reproducibility (R) was calculated by analysis of two to six loci from the 16 panel 2 isolates in at least two centers by using a number of different conditions (including the use of reagents from different manufacturers) for DNA extraction, PCR, and DNA sequencing. Epidemiological concordance (E) was expressed as the number of epidemiologically related sets of strains found to be indistinguishable by the typing system divided by the total number of such sets in the panel. Estimates of the indices of discriminatory power (D) were determined by using Hunter and Gaston's (20) modification of Simpson's index of diversity (30,32). If the sample consists of completely unique subtypes, then the value of the indices is 1 (maximum value), while a value of 0 (minimum value) occurs when all isolates have an identical subtype. Due to the sampling process, the calculated diversity indices are estimates of the unknown true value for the population from which the sample was taken. Therefore, confidence intervals were calculated to convey the precision about these point estimates by using the formula below, originally described by Simpson (30): where is the standard deviation, N is the total number of isolates, and x i is the number of isolates in the ith category. Approximate 95% confidence intervals (CIs) were then calculated by using the usual formula: Individual, combined D values and 95% confidence intervals were calculated by using the allelic profiles of the isolates analyzed in this study (as described below), together with data from flaA, proA, and monoclonal antibody (MAb) subgrouping results from previous studies (8,10,11). The stability of each gene was assessed by analysis of the five variants of the same strain (panel 3).
Sequence-based typing was performed essentially as described previously (11), with the following modifications and additions. Since the previous study, optimization of the amplification and sequencing conditions for the mompS gene target have resulted in improved sequence quality. Therefore, these new conditions were used in this study. Three new gene targets, asd, mip, and pilE, were also investigated.
DNA extraction, PCR amplification, and DNA sequencing. Genomic DNA was extracted as described previously (11) or by emulsifying two colonies of L. pneumophila in 0.5 ml sterile water and heating for 8 min at 100°C. Oligonucleotide primers targeting regions of each of the genes were used to amplify 245-to 648-bp products encompassing regions of variation ( Table 1). The same primers were used for amplification and sequencing of all targets except mompS (amplification primers, mompS-492F and mompS-1116R; sequencing primers, mompS-492F and mompS-1015R). Amplification by PCR was performed as described previously (11) or with 5 to 10 l lysate. Amplification was performed by using the following conditions: 35 cycles of denaturation for 30 s at 94°C; annealing for 30 s at 50°C (for flaA, mompS, and proA), 55°C (for pilE), 60°C (for mip), or 62°C (for asd); and elongation for 30 s at 72°C. In order to establish whether the same program (annealing temperature) could be used for the primary amplification of all loci, a range of annealing temperatures, 50°to 65°C (for all loci), was evaluated by visualization of the primary amplicons, together with analysis of the sequence data generated from these amplicons.
Purification of amplicons was performed as described previously (11) or with Montage PCR 96 filter plates (Millipore). The sequencing products were analyzed as described above or on a model 3730XL ABI DNA sequencer (Applied Biosystems), following the instructions supplied by the manufacturer.
Sequence analysis. Sequence analyses were performed locally by using the programs described previously and Chromas (Technelysium Pty Ltd., Australia; http://www.technelysium.com.au), Readseq, SeaView (13), Sequencher (Gene Codes Corporation), Autoassembler, or Sequence navigator (Applied Biosystems). For all analyses the data obtained with the forward and reverse sequencing primers were combined and aligned manually to produce a consensus se- quence. Consensus sequences trimmed to the correct length were then identified by comparison with the sequences of preexisting alleles by using online tools. Chromatogram traces from putative novel allele types were submitted to the coordinating center for verification. The genes, the reference sequences, the fragment sizes of the primary PCR products and regions used in the analysis, and the number of alleles found in this study are shown in Table 2. Final analysis of the complete data set was performed by the coordinating center with BioNumerics software (Applied Maths). Nomenclature and description of allelic profiles. The major outer membrane gene of L. pneumophila was originally described by Hoffman and colleagues (19) and was designated ompS. The previous study (11) and the present study were based on a distinct major outer membrane protein precursor gene designated mompS (GenBank accession no. AF078136; Cristoph and Ehret, unpublished). For this study, the notation mompS is used. As in the previous study (11), for each gene of L. pneumophila analyzed, identical sequences were assigned to the same allele number, e.g., flaA (1), and different sequences were assigned distinct allele numbers, e.g., flaA(1), flaA (2), flaA (3), etc. For each isolate, the combination of alleles at each of the loci was defined as the allelic profile or sequencebased type (SBT) by using a predetermined order, i.e., flaA, pilE, asd, mip, mompS, and proA. For example, for strain EUL 120, the allelic profile (or SBT) is 4,7,11,3,11,12. If an individual allele number has not been determined, a 0 is entered into the allelic profile, thus maintaining its integrity. For example, if the proA allele number was not determined for the example above, the profile would be 4,7,11,3,11,0; and if the mompS allele was not determined, the profile would be 4,7,11,3,0,12. This format allows the sequential addition of future gene targets, subject to appropriate evaluation and consensus agreement. The authors also took the decision not to designate single number sequence types representing the complete allelic profile at this time. Thus, a novel allele, x, would be a seventh allele, and thus, all profiles would then include seven numbers. Assignment of new allele numbers is made by the coordinating center, which also curates the online SBT database. In order to designate alleles, the consensus text sequence and the forward and reverse chromatogram files from putative new alleles are examined, and the new allele number added to the database, subject to verification by at least two people. In this study the sequences of all allele numbers for all loci were confirmed in at least two of the participating centers.
Nucleotide sequences. The sequences from the isolates of L. pneumophila described in this study are available from the authors or at http://www.ewgli.org.

Typeability.
All isolates included in the study yielded PCR products of the expected size and DNA sequences with primers specific for all genes when they were tested by all four centers. For each isolate, the alleles either could be assigned to a preexisting allele number or could be identified as novel alleles in the course of this study; thus, T was equal to 1.0.
Sequence variation. The number of nucleotides from each of the gene targets included in the analysis ranged from 182 to 473 bp ( Table 2). The numbers of allele types and polymorphic nucleotide sites and the percentage of nucleotide substitutions within each gene locus from the analysis of the panel of 79 strains are shown in Table 2. The highest percentage of polymorphic sites was found in the mompS gene (13.8%), and the least was found in the mip gene (3.0%).
Reproducibility. Data from consensus sequences from six loci (flaA, pilE, asd, mip, mompS, and proA) from all 16 isolates tested by the laboratories by using different methods of DNA extraction, PCR cycling conditions, and DNA sequencing platforms were in complete agreement (R ϭ 1.00). By using the cycling conditions described with an annealing temperature of 55°C, primary amplicons suitable for DNA sequence analysis and good-quality sequence data were obtained from all loci.
Epidemiological concordance. All six of the sets of isolates included in the epidemiologically related panels of isolates had compelling evidence of epidemiological relatedness (8); and previous analyses revealed concordant MAb subgrouping, restriction fragment length polymorphism analysis, restriction enzyme analysis, amplified fragment length polymorphism analysis, and sequence-based typing results (8,10). Six genes were sequenced for the 16 isolates representing the six sets of epidemiologically related strains and two replicates of the same strain. The epidemiologic concordance (E) was calculated for each of the genes analyzed by using this set of 16 isolates, and for each locus E was equal to 1.00. The mip gene and the proA gene could differentiate only five of the six sets of strains, whereas all of the other targets could distinguish a With respect to the reference sequence (see Table 3). , which comprised clinical and environmental isolates belonging to serogroups other than sg 1, were also epidemiologically concordant (E ϭ 1.00), suggesting epidemiological linkage, and gave unique profiles for each set ( Table 4). The allelic profiles of the unrelated reference strains from sg 1, sg 6, sg 8, and sg 10 are also shown in Table 4 and were distinct from those of the epidemiologically related sets belonging to the same serogroup.

Indices of discrimination.
Estimates of individual indices of discrimination (D) obtained with the six genes, flaA, pilE, asd, mip, mompS, and proA, and various combinations of two or more alleles were determined by using the panel of 79 unrelated clinical isolates (Table 5). Individual-locus D values ranged from 0.767 (proA) to 0.848 (mompS), and maximum discrimination (i.e., D ϭ 0.943) was achieved by using all six loci. By using the combination of the three loci flaA, pilE, and asd and further combinations obtained by sequential addition of the remaining three loci, the resulting indices were almost identical, i.e., D ϭ ϳ0.94. The upper 95% confidence intervals of these indices were all above 0.95, indicating that the true index of diversity could be above 0.95 and that the lower 95% confidence limits are slightly above 0.9. When SBT data for all six loci were combined with MAb subgrouping data, the estimated index of diversity was 0.981 and the lower 95% confidence limit was 0.967, indicating that the true index of diversity was unlikely to be below 0.95. If just the combination of flaA, pilE, asd, and MAb types was used, the estimated index of diversity was 0.978 and the lower 95% confidence limit was 0.966, again indicating that the true index of diversity was unlikely to be below 0.95 for this combination.

DISCUSSION
Sequence-based typing of L. pneumophila, first applied to strains belonging to sg 1 by Gaia and colleagues (11), demonstrated the potential application of this technique to the investigation of outbreaks of legionellosis. In this study we present data validating the method for three additional genes and demonstrating the application of this approach to other serogroups of L. pneumophila.
The primary aim of this study was to seek to improve the level of discrimination offered by investigation of the additional genes, asd, mip, and pilE, and the robustness of the data from one of the previous targets, mompS. Indices of discrimination for single and multiple gene loci (with and without MAb data) were calculated by using a well-defined panel of isolates, allowing ready comparison. Various publications report that D values of 0.90 and above are desirable, if the typing results are to be interpreted with confidence (20), or report that D values of 0.95 and above are ideal for use in a typing system (32). However, these values assume that the sample used is representative of the total number of subtypes in the population of interest. Although data from environmental isolates of L. pneumophila are limited, it appears from previous studies that clinical isolates of L. pneumophila show less genotypic variation than that found in environmental isolates, which have not been associated with human infections (8,11). As the TABLE 4. Allelic profiles of L. pneumophila isolates belonging to serogroups other than sg 1 from three epidemiologically related sets and four unrelated reference strains panel of 79 isolates used to calculate the indices of discrimination was composed entirely of clinical isolates belonging to sg 1, albeit from 10 different European countries, the D values generated probably represent an underestimate of the true discrimination that would be obtained by sampling a larger range of clinical and environmental isolates from different geographical regions and of other serogroups. However, use of the same panel of strains in multicenter studies has facilitated the calculation and comparison of the relative discriminatory powers of different typing systems for the discrimination of L. pneumophila sg 1 isolates. The use of only three loci, flaA, pilE, and asd, gave an index of discrimination of 0.94; further combinations obtained by the inclusion of mip, mompS, and proA did not appear to offer significantly higher values. However, as the lower 95% confidence limits are above 0.9, this suggests that these combinations are acceptable as the basis of a typing method. Use of the combination of all six loci (flaA, pilE, asd, mip, mompS, and proA) combined with MAb subgrouping yielded a value of 0.981, with a lower 95% confidence limit of 0.967, indicating that the true D value was unlikely to be below 0.95. As such, the approach meets that specified for an "ideal" typing system (32). In order to demonstrate the validity of this sequencebased typing approach for strains of L. pneumophila other than sg 1, a number of additional isolates were also characterized and gave concordant results; i.e., clinical and environmental isolates that were epidemiologically linked gave identical profiles for all six loci. The authors believe that this study demonstrates the validity of the sequence-based typing approach for the typing of L. pneumophila isolates. However, the inclusion of additional well-characterized isolates belonging to serogroups other than sg 1 in such panels would assist with the determination of confidence limits for non-sg 1 strains.
An expanded online SBT database (version 1.5) can now be queried (see www.ewgli.org). The website also proves detailed instructions regarding the submission of putative novel alleles, i.e., submission of consensus sequences together with sequencing results (chromatogram files) from both forward and reverse reactions, to the database curators. As indicated above, a level of discrimination of about 0.94 was achieved by using only three targets, flaA, pilE, and asd; a level of discrimination of 0.943 was achieved by using all six loci; and a level of discrimination of 0.981 was achieved by using all six loci and MAb subgrouping data. An ideal system for the typing of L. pneumophila should facilitate the ability to determine the relatedness or otherwise of clinical and environmental isolates. In an outbreak situation, depending on the numbers of isolates involved, it is still advisable to first confirm the serogroup and perform MAb subgrouping with sg 1 isolates (subject to the availability of MAb panels). When isolates are not yet available, e.g., in the first 1 to 2 days of an outbreak investigation, it has been shown that the direct amplification of SBT primary amplicons from clinical and environmental samples is possible, for example, with flaA and pilE (i.e., the targets with the smallest primary amplicon size [authors' unpublished data]), thus enabling the rapid generation of sequence data and, thus, valuable typing data.
The authors propose that epidemiological typing of L. pneumophila isolates be carried out by using sequence-based typing, as described here, with all six loci, at least until a larger data set is established and reviewed. The resulting sequence-based type (or allelic profile) is thus defined by the cumulative allele number description of each locus, in the order flaA, pilE, asd, mip, mompS, and proA. The absence of sequence information for any locus is entered into the profile as a 0.
In conclusion, we have described and evaluated an improved method for the sequence-based epidemiological typing of L. pneumophila. Modifications to the previous online SBT database now allow users to query the database and identify preexisting allelic profiles for the six target genes described here, and the website provides instructions concerning the submission of novel allele types and profiles.

ACKNOWLEDGMENTS
We thank Nita Doshi and John Duncan for expert technical assistance; Jon Green, Anthony Underwood, and William Bellamy for work on the web-enabled database; André Charlett for statistical analysis and advice; and Robert George for constructive comments on the manuscript.
Baharak Afshar and William Bellamy were funded by the European Commission.

ADDENDUM IN PROOF
Analysis of the complete genome sequences of three L. pneumophila strains, Philadelphia-1 T , Paris, and Lens, reveals that both the Paris and Lens strains contain two copies of the mompS gene, whereas the type strain (Philadelphia-1 T ) contains only one. The mompS amplification primers described in our study amplify only a single copy of mompS due to sequence variation in the noncoding flanking regions.