Core Genome Multilocus Sequence Typing and Single Nucleotide Polymorphism Analysis in the Epidemiology of Brucella melitensis Infections

The use of whole-genome sequencing (WGS) using next-generation sequencing (NGS) technology has become a widely accepted method for microbiology laboratories in the application of molecular typing for outbreak tracing and genomic epidemiology. Several studies demonstrated the usefulness of WGS data analysis through single-nucleotide polymorphism (SNP) calling from a reference sequence analysis for Brucella melitensis, whereas gene-by-gene comparison through core-genome multilocus sequence typing (cgMLST) has not been explored so far.

disease, however, still persists in several countries in the Mediterranean area. In Italy, despite implementation of the brucellosis eradication program for over 50 years, ovine and caprine brucellosis remains endemic in several southern provinces, in Sicily in particular (6). To date, the regions of Italy still not classified as OBF cover approximately 35.5% of the national land surface, where 39.9% of all small ruminants are farmed (7,8). The current brucellosis surveillance system in Italy involves regular serological testing and slaughtering of the positive animals from which a bacteriological isolation is performed for confirmation of the diagnosis. Control testing is performed less frequently in the OBF regions, where the goal is to control reintroductions of the disease, whereas it is continuous in the affected areas, where the main aim is eradication of brucellosis.
Efficient and reliable surveillance programs are essential for detection and control of outbreaks and largely depend on collection and access to epidemiological data. Currently, epidemiological investigations rely on the availability of standardized and effective molecular typing methods and analysis tools that allow the public health laboratories to identify and trace an outbreak back to its source.
Identification and typing of B. melitensis are still traditionally performed with the use of biotyping techniques. This methodology, however, suffers from inconsistencies and requires handling of the live bacteria. For this reason, PCR-based typing is now commonly used as an alternative to the culture-dependent typing methods (9)(10)(11)(12). The results of the classical biotyping schemes categorize B. melitensis into three biovars that are of limited epidemiological value, as they do not provide sufficient resolution between the isolates. Moreover, an individual biotype often predominates in particular areas, as seen in Italy, where biovar 3 is almost exclusively isolated from the local animal populations (13). B. melitensis is a highly clonal, i.e., monomorphic pathogen, which renders its differentiation at the strain level very difficult (14). Pattern-based techniques such as pulsed field gel electrophoresis and amplified fragment length polymorphism have been applied in the past, but these techniques were not able to differentiate Brucella at the subspecies level, which correlated with low intra-and interlaboratory reproducibility (15). In recent years, the typing methods have shifted toward genomebased approaches that finally allowed an accurate differentiation between Brucella isolates and establishment of a common consensus for the subtyping schemes of this pathogen (6,(16)(17)(18).
To date, multilocus variable number of tandem repeats analysis (MLVA) has been considered the most efficient typing method for Brucella spp. Several studies demonstrated that MLVA has a high discriminating resolution, in congruence with MLST, and is sufficient for in-depth study of either genome evolution or outbreak epidemiology (19). According to MLVA schemes, the B. melitensis population can be divided into West Mediterranean, East Mediterranean, and American lineages (20,21). Moreover, with the development of an international repository, the MLVA data can be stored on web servers and shared between research institutes, thereby increasing MLVA utility as a tool used for analysis of Brucella epidemiology in the world (http://microbesgenotyping .i2bc.paris-saclay.fr/databases/view/907) (22). However, this typing method has several weaknesses, related both to the nature of variable-number tandem repeats (VNTRs) as well as to laboratory demands of the technique itself (12).
With advances in and decreased cost of whole-genome sequencing (WGS), new methods of pathogen typing, including gene-by-gene comparison using core genome multilocus sequence typing (cgMLST), as well as single-nucleotide polymorphism (SNP) calling based on a reference sequence analysis, are considered to be a suitable and more informative replacement of the gold standard typing schemes (23)(24)(25)(26). cgMLST is performed by assigning specific alleles to a predefined set of core genes, i.e., genes present in all strains of a given bacterial species. Validated schemes for several pathogens are publicly available and can be shared to ensure reproducibility and comparability of the results across laboratories (23).
The aims of our study were to develop a cgMLST scheme for B. melitensis and to assess the performance of cgMLST and a whole-genome SNP-based approach against     coverage ranged from 18ϫ to 356ϫ, with an average of 155ϫ. All scaffolds were assembled with SPAdes version 3.11.1 with the -careful option selected (28,29). cgMLST target definition. To determine the cgMLST gene set, we performed a genome-wide gene-by-gene comparison using the cgMLST Target Definer (version 1.4) function of the SeqSphereϩ software, v5.0.90 (Ridom GmbH, Münster, Germany), with default parameters. These parameters comprised the following filters to exclude certain genes of the B. melitensis bv. 1 strain 16M reference genome (NC_003317.1 and NC_003318.1) from the cgMLST scheme: a minimum length filter that discards all genes shorter than 50 bp, a start codon filter that discards all genes that contain no start codon at the beginning of the gene, a stop codon filter that discards all genes that contain no stop codon or more than one stop codon or if the stop codon is not at the end of the gene, a homologous gene filter that discards all genes with fragments that occur in multiple copies within a genome (with identity of 90% and more than 100-bp overlap), and a gene overlap filter that discards the shorter gene from the cgMLST scheme if the two genes affected overlap by Ͼ4 bp. The remaining genes were then used in a pairwise comparison using BLAST, version 2.  (30). Using all genes of the reference genome that were common in all query genomes, with a sequence identity of Ն90% and 100% overlap and with the genome filters start codon filter, stop codon filter, and stop codon percentage filter turned on, the final cgMLST scheme was formed. Therefore, all genes having no start or stop codon in one of the query genomes, as well as genes that had internal stop codons in more than 20% of the query genomes, were discarded. SNP analysis. SNPs were identified using In Silico Genotyper (ISG), version 0.16.10-3 (31). We used default filters to remove SNPs from duplicated regions, minimum quality was set to Phred 30, and the minimum allele frequency was set to 90% in all samples. We used the ISG pipeline with BWA-MEM (version 0.712-r1039) (32) as the aligner and GATK (version 3.9) (33) as the SNP caller. The SNPs were called based on alignment to the reference Brucella melitensis bv. 1 strain 16M (GenBank accession numbers NC_003317.1 and NC_003318.1). Clean unique variants used in further analysis are listed in Tables S1 and S2 in the supplemental material.
Clustering analyses and cluster definition. MLVA-16 allelic profiles and SNP matrix data were analyzed using the goeBURST algorithm implemented in PHYLOViZ, version 2.0 (34). Minimum spanning trees (MST) were created using default software settings. The cgMLST profiles were assigned using B. melitensis task template in Ridom SeqSphereϩ (35,36). MSTs were created by pairwise comparison of cgMLST target genes. Missing values were ignored in the calculation of distance between pairs of sample profiles. The links between the MST nodes represented the distance between the genotypes. The cluster cutoff value was defined as the maximum pairwise distance found between epidemiologically linked isolates. The maps in Fig. 1A and B were drawn with SeqSphereϩ by using GeoNames (http://www .geonames.org) for geocoding and Natural Earth (http://www.naturalearthdata.com) for drawing vector maps.

RESULTS
Epidemiologically linked B. melitensis isolates. The outbreak-related isolates were detected in 21 different farms in three Italian provinces over a period of 1.5 years. The culture-positive samples belonged to 37 animals that were investigated as a part of the within-and among-farm epidemiological investigation ( Fig. 1 and Table 1).
MLVA-16 revealed the presence of 13 different genotypes, divided into two groups formed by single-locus variants and one double-locus variant ( Fig. 2A). MST showed that the groups were split by mutations in the three hypervariable loci bruce04, bruce09, and bruce16. One group included three genotypes of four isolates collected from farms located in the province of Rome, whereas in the remaining 33 strains from Isernia, Campobasso, and Frosinone provinces we identified 10 distinct genotypes.
We generated a cgMLST scheme comprised of 2,704 targets based on the B. melitensis 16M reference genome. The cgMLST clustering divided the isolates into two different genetic complexes, grouping the two farms from the province of Rome (complex 2) separately from the remaining 19 farms (complex 1). The genetic division measured with the cgMLST panel was for 164 different genes (Fig. 2B). The analysis using the B. melitensis panel found one prevalent genotype that was similar across the provinces of Frosinone, Campobasso, and Isernia and was found in 10 of the tested farms.
Sixteen isolates in complex 1 shared identical core genome profiles, and the largest distance between any two neighboring isolates was not greater than three genes. In complex 2, one isolate was separated from the other three by one gene difference.
Removing 50 targets from the analysis where any value was missing decreased the distances between the nodes even further and classified all samples from Rome as identical (not shown). A within-farm genetic variation was also observed.
The SNP analysis identified 3,390 SNPs, of which 3,146 were classified as clean unique variants and included in further analysis. The tree split the samples into two genetic clusters with a distance of 244 SNPs between them (Fig. 2C). We observed a within-farm variation of 2 MLVA-16 loci, 3 cgMLST loci, and 4 SNPs. The maximum pairwise distance found in the two complexes was 6 cgMLST genes and 7 SNPs.
The comparison of discriminatory power of MLVA, cgMLST, and SNP typing showed that the SNP-based approach was superior to the other two methods, with an SDI of 0.922 and 95% confidence intervals (CI) of 0.866 to 0.978. SDI of cgMLST was calculated to be 0.815 (95% CI, 0.685 to 0.945), and SDI of MVLA-16 was 0.674 (95% CI, 0.505 to 0.843). SNP typing was a good predictor of cgMLST, with an AW of 0.788 (95% CI, 0.546 to 1.000). The correspondence of the typing results, however, was not bidirectional, as the cgMLST to SNP AW was 0.295 (95% CI, 0.136 to 0.453). Comparison of the remaining pairs of typing schemes showed that there was no congruence between clusters they predicted (the AW of each pair did not exceed 0.03).
B. melitensis isolates with unknown epidemiological status. MST calculated using the MLVA-16 typing results showed a distance between directly linked nodes not exceeding 9 VNTR loci (Fig. 3). Fifty-one MLVA-16 profiles were assigned to the 71 strains, and diverse allele variants were identified in all loci apart from bruce45. Eleven profiles were shared by more than one isolate, which, with the exception of one human isolate, corresponded to the samples originating from the same geographical location (Table 1).
MLVA profiles tend to be conserved between epidemiologically linked strains; therefore, the strains from an outbreak are likely to have a similar MLVA profile. Three The distance labels correspond to the number of discriminating SNPs between neighboring genotypes. The prefix ItBM was omitted from the isolates' labels for simplicity.
MLVA-16 profiles, 10 (samples ItBM_41 to ItBM_44), 15 (samples ItBM_93 to ItBM_96 and ItBM_98), and 24 (ItBM_55 and ItBM_89 to ItBM_91), were identified in more than three strains, suggesting close relatedness of samples within these profiles. The method also allowed identification of two clear outliers. Samples ItBM_38 and ItBM_39 showed a distance of 9 alleles from the nearest B. melitensis isolate and no relatedness to one another.
According to our MLVA-16 data, only three out of six human cases could be linked to a specific animal source analyzed in our study. Human samples ItBM_41 and ItBM_43, isolated from two patients in the city of Salerno, shared the same MLVA-16 profile as two animal isolates from a farm in Salerno province (samples ItBM_42 and ItBM_44), all collected in 2011. Human isolate ItBM_50 and two animal isolates (ItBM_51 and ItBM_52) were assigned MLVA-16 profile 42, but interestingly, ItBM_50 was isolated 4 years later than the animal strains. The other three human samples did not show sufficient relatedness to any of the animal isolates to reliably trace the source of infection. The number of variable loci, in these cases, ranged from 2 to 9 in relation to the closest neighboring MLVA-16 profile.
To increase the discriminatory power of the investigation, we analyzed 71 assemblies using a cgMLST scheme. The genome assemblies exceeded 98% of good targets (with a mean of 99.4%). Isolates ItBM_38 and ItBM_39 were clear outliers, separated from the closest neighbor by 1,227 genes, and 1,096 loci from one another (Fig. 4A).
Based on the analysis of epidemiologically related isolates, we used 6-gene difference as a threshold for a potential complex of related cases. Thirteen complexes were The tree was generated using the goeBURST algorithm in PHYLOViZ software. The distance labels correspond to the number of discriminating alleles. The red nodes correspond to human isolates and the blue nodes to animal isolates. The prefix ItBM was omitted from the isolates' labels for simplicity. assigned in the MST data analysis. Gene-by-gene analysis confirmed relatedness of genotypes with MLVA-16 profiles 10 and 15; however, according to cgMLST two other isolates were at a distance of 0 to 1 gene away from the samples of MLVA-16 profile 15, as was one other isolate of profile 10. ItBM_55, classified as MLVA-16 profile 24, was shown not to be closely linked to other isolates with the same MLVA-16 alleles when examined with a gene-by-gene approach.
Using cgMLST, four of the human isolates (ItBM_41, ItBM_43, ItBM_50, and ItBM_108) were found in the distance not exceeding 2 alleles to the closest animal strain. Two of the human samples originating in Piedmont (ItBM_99 and ItBM_78) were genetically different from the animal samples, with 156 and 195 allele differences from the closest isolate, and could be identified as outliers, although they were distantly related to other Italian genotypes. Divergence of these two samples was not evident in MLVA-16 typing (distance of 2 to 3 alleles to other isolates).
A total of 6,540 SNPs were discovered by mapping 71 genomes to the B. melitensis 16M reference strain. Out of these, 6,027 were considered high-quality discriminatory SNPs and were used to infer the relationship between the strains. We applied the threshold of 7 SNPs to detect the clusters of closely related cases, and in accordance with cgMLST analysis, we identified 13 complexes (Fig. 4B). The highest distances observed between two adjoining isolates were 2,616 and 2,235, belonging to the SNP profiles of ItBM_38 and ItBM_39, which also were marked as outliers by MLVA-16 and cgMLST analyses.
In agreement with cgMLST, two human cases (ItBM_78 and ItBM_99) could not be traced to any of the analyzed animal strains of B. melitensis, and both differed by more than 200 SNPs from the nearest SNP profile. Close genetic relationship to at least one isolate from an animal host was confirmed for ItBM_41, ItBM_43, ItBM_108, and ItBM_50.

DISCUSSION
Our study compared the performance of two WGS-based typing methods, SNP analysis and cgMLST, with the gold standard MLVA-16 in an analysis of the phylogenetic relationship between isolates of B. melitensis detected in the context of a national surveillance program.
We found that all three typing schemes generally performed equally, and although SNP analysis had the highest resolving power in terms of differences detected between the isolates, the number of predicted genotypes in the surveillance scenario was comparable for all examined methods (51 MLVA-16 types, 55 cgMLST types, and 60 SNP types), and the SDI were similar. However, SDI test applied to samples from epidemiologically linked sets showed that SNP analysis was superior in differentiating between closely related samples. This suggests that while WGS-based approaches could be used as standalone tools in establishing phylogenetic relationships, MLVA-16 optimally should be supported by either SNP or gene-by-gene typing results.
In our study, all three typing methods accurately predicted the presence of two genomes divergent from the rest of the Italian strains. Indeed, the majority of analyzed samples belonged to the West Mediterranean lineage of B. melitensis, while the outliers were members of the East Mediterranean and American lineages (6). Epidemiological investigation showed that ItBM_38 was isolated from a Syrian patient with a history of frequent travel to his home country, where the same East Mediterranean lineage is thought to be prevalent (39). The strain ItBM_39, on the other hand, was isolated from a goat imported to Italy from Spain. Two human isolates, ItBM_50 and ItBM_108, were found in the same SNP and cgMLST complexes as animal strains, but interestingly, the samples were collected a few years apart and in different geographical locations, suggesting that animal isolates could have been closely related (or ancestral) to the source of human infection but not directly involved in the transmission event. In these cases, observation based on WGS typing indicates that strains of B. melitensis were circulating in the affected regions of Italy for many years and the surveillance program failed to eradicate them.
For distantly related genomes from the same lineage, cgMLST as well as SNP analysis provided higher phylogenetic distance resolution than MLVA-16, and therefore spotting divergent genotypes unlikely to be connected to the other circulating strains was possible with greater confidence. This was particularly apparent in the case of two clinical isolates (ItBM_99 and ItBM_78) and in the case of B. melitensis collected from an ibex (Capra ibex ibex) in Gran Paradiso National Park, located in the Graian Alps in Italy (sample ItBM_100). This demonstrated that while all applied schemes could be used to identify very distant genomic outliers within the Brucella population, WGS-based schemes were superior in identifying unrelated cases belonging to the same lineage. Additionally, within the clusters of similar genotypes, cgMLST performed equally to the SNP analysis, but some discrepancies were observed in MLVA-16 analysis. For instance, seven isolates from Sicily had profiles differing by a maximum of two SNPs or one gene (samples ItBM_92-ItBM_98), suggesting that they were very closely related. However, while five of these isolates shared MLVA-16 profile 15, one belonged to type 8 (1 allele distant; bruce19) and another to type 12 (2 alleles distant; bruce4 and bruce7). The interpretation of WGS results therefore suggests that these were actually strains from the same complex, while MLVA-16 typing would not necessarily lead to the same conclusion. A similar observation was reported by Dallman et al. (40), who showed that using SNP analysis of E. coli O157 isolates identified linked cases with twice the sensitivity of the MLVA-16 scheme, while Georgi and colleagues (39) demonstrated that MLVA-16 had lower discriminatory power than the WGS-based SNP typing by analyzing a set of 63 human B. melitensis isolates. Interestingly, in our cluster of outbreak-related cases, we identified several genotypes that differed by one, two, or three hypervariable alleles and belonged to an outbreak caused by a single epidemic clone. WGS-based analysis of these strains showed that they were very closely related (up to 6 genes or 7 SNPs of difference). Together, these observations show that MLVA-16 profiling might not provide enough resolution to accurately predict phylogenetic relationships between isolates involved in an ongoing outbreak or strains that have been circulating over the years with no direct link to one another. SNP analysis has successfully been used to discriminate between Brucella species and to map the geographic distribution and global spread of B. melitensis (18,39,41). However, to date there is no official and validated cgMLST scheme for any of the Brucella species. Consequently, the cluster types for specific data and particularly for closely related strains can only be assessed empirically and therefore are subject to variation between laboratories. In order to reliably interpret the results, cutoff values first should be established based on the analysis of a significant number of closely related strains and unrelated strains sharing common or closely related profiles assigned using gold standard typing methods. The analysis of outbreak-related isolates suggested that two independent epidemic clones were circulating in central Italy at the same time. The maximum pairwise distance between isolates within complexes formed by these clones did not exceed 6 genes (cgMLST) or 7 SNPs. These findings highlight the potential criteria necessary for inclusion of an isolate into a brucellosis outbreak cluster that we would therefore suggest to be Յ6 loci in the cgMLST and Յ7 in WGS SNPs analysis.
Jackson et al. argued that a general cutoff value applied in SNPs or cgMLST could not always reliably predict whether samples were epidemiologically related and that isolates with SNP differences ranging from 10 to 30 were frequently linked (42). Thus, we believe that the proposed cutoff values should be taken as a guideline and interpreted in the context of available epidemiological information.
Using a typing approach that offers maximum resolution is particularly important for tracing the spread of a disease during an outbreak. SNP analysis potentially has the highest discriminatory power among the typing methods, as polymorphisms can be discovered in both coding and noncoding regions of the genome. However, the choice of a reference genome can significantly influence the number of identified SNPs and the accuracy of the reconstructed phylogenetic relationships (43). cgMLST relies on the availability of complete, accurately sequenced genomes for the generation of the typing schemes. Inclusion of coding sequences only decreases the number of sites typed in the analysis, but at the same time it facilitates standardization and reproducibility of the analyses as it focuses on a predefined set of genes. In WGS analysis the quality of the reads as well as of the assembly plays a crucial role in achieving reliable cgMLST results. While in our study all samples reached at least 98% of good targets, low-quality assemblies are likely to have a reduced number of good targets and therefore lead to generation of inaccurate results in phylogenetic analysis. We therefore propose that the data with good targets of less than 97% should be taken with caution.
In conclusion, WGS/NGS data can be used effectively to gain a better understanding of epidemiology and dynamics of Brucella populations and to gather in-depth information which can be used for source tracing in case of outbreaks within animal holdings, zoonotic or foodborne infections, and illegal animal movements. Moreover, WGS data facilitate the assessment of the possible extent of an ongoing outbreak and the reliable prediction of the routes of its spread.
In accordance with the One Health approach, public health agencies can implement WGS to aid in disease control and eradication plans. In our study, both cgMLST and SNP analysis performed well despite the restricted level of B. melitensis genetic diversity, and we demonstrated that the performance of the gene-by-gene approach was comparable to that of the SNP analysis. On the basis of these results, we believe that MLVA-16 typing of B. melitensis in Italy can now be successfully replaced by the more informative WGS analysis.

SUPPLEMENTAL MATERIAL
Supplemental material for this article may be found at https://doi.org/10.1128/JCM .00517-18. D.H. is one of the developers of the Ridom SeqSphereϩ software mentioned in the article, which is a development of the company Ridom GmbH (Muenster, Germany) that is partially owned by him. The other authors have declared no conflicts of interest.