Previous Article | Next Article ![]()
Journal of Clinical Microbiology, October 2004, p. 4566-4576, Vol. 42, No. 10
0095-1137/04/$08.00+0 DOI: 10.1128/JCM.42.10.4566-4576.2004
Pathogen Genomics Group, Institute for Biological Sciences, National Research Council of Canada,1 Bureau of Microbial Hazards, Health Canada,2 Ottawa-Carleton Institute of Biology, Carleton University, Ottawa, Ontario, Canada3
Received 17 March 2004/ Returned for modification 28 April 2004/ Accepted 8 June 2004
|
|
|---|
|
|
|---|
The need for alternative subtyping schemes has been recognized, leading to the development of a number of different methods based on differences at the DNA level (i.e., genotyping). The techniques used at present range from analysis of polymorphisms in groups of housekeeping genes (multilocus sequence typing [5, 26]), amplified fragment length polymorphism analysis (28), restriction fragment length polymorphism analysis of flaA or rRNA genes (for a review, see reference 32), and pulsed-field gel electrophoresis (PFGE) analysis of macrorestriction patterns (34). Despite the large number of competing approaches, PFGE (8) and, more recently, multilocus sequence typing (26) have emerged as the present "gold standard" genotyping methods, and considerable efforts have been made to standardize protocols in order to facilitate interlaboratory comparisons (32).
One potential weakness shared by these genotyping approaches is that strain relatedness is inferred on the basis of limited subsampling of the entire genome (14). Whole-genome sequencing provides the most complete data set for comparative genomics studies; and genome sequence data are available for more than one strain of an increasing number of bacterial species, which includes Helicobacter pylori (1) and Escherichia coli (24, 33), among others. The genomic sequence of C. jejuni NCTC 11168 was completed by Parkhill et al. (21); and preliminary genome sequence data for a second C. jejuni strain, RM1221 (18), has recently been made available by The Institute for Genomic Research (TIGR; http://www.tigr.org). Despite these efforts, species with multiple strain coverage number less than 20, and in most cases genome sequence data sets are restricted to two strains. Although sequencing of bacterial genomes has become technically straightforward, it remains expensive and logistically demanding. It is unlikely that whole-genome sequencing can be used for genotyping or large-scale comparative genomics.
Whole-genome sequence data have been used to construct full-genome DNA microarrays that include every open reading frame (ORF) in a genome strain. Microarray-based comparative genomic hybridization (CGH), in which labeled DNAs from two strains are competitively hybridized to a full-genome microarray, has been described in numerous reports (2-4, 6, 9, 11, 20, 27). Microarray-based CGH provides both rich data sets for whole-genome genotyping and an indirect approach for comparative genomics in the absence of whole-genome sequence data. Studies on the genetic diversity and the feasibility of using DNA microarrays as a tool for the genotyping of C. jejuni (7, 14, 22) have demonstrated the value of microarray-based CGH in comparative genomics.
Initial observations of C. jejuni genetic variability by Dorrell et al. (7) were based on a survey of 11 strains. Leonard et al. (14) and Pearson et al. (22) have recently analyzed an additional 16 and 18 strains, respectively. These small-scale studies have revealed extensive genetic variability in C. jejuni, underscoring the need to further characterize intraspecies variability through large-scale surveys involving data sets comprising greater epidemiological, phenotypic, and geographical strain diversities. With the 51 strains analyzed in the present study, the cumulative data on C. jejuni represent the largest and most diverse microarray-based comparative genomics data set to date. We describe here a detailed meta-analysis of all available C. jejuni CGH data. The data have provided us with a comprehensive picture of global C. jejuni gene conservation patterns that suggest low levels of genome plasticity. This analysis has also enabled us to define a highly robust set of variable genes for genotyping of C. jejuni. An increasing body of C. jejuni CGH data will enable us to begin formulating hypotheses about C. jejuni genome evolution and the development of the wide variation in virulence, pathogenicity, and host specificity observed in this economically and medically important human pathogen.
|
|
|---|
|
View this table: [in a new window] |
TABLE 1. C. jejuni strains analyzed by CGH in this study
|
Isolation of genomic DNA. C. jejuni strains were harvested from the growth on plates that had been incubated for 24 h, resuspended in 10 mM Tris-10 mM EDTA (pH 8.0), and treated with lysozyme (Roche, Laval, Quebec, Canada) and RNase A (Qiagen, Mississauga, Ontario, Canada) for 10 min at room temperature. The cell suspensions were then digested with proteinase K (MBI Fermentas, Burlington, Ontario, Canada) for 1 h at 37°C, and complete lysis was obtained by addition of sodium dodecyl sulfate to a final concentration of 0.1% (wt/vol). Genomic DNA was extracted from the cell lysates by three extractions with phenol-chloroform-isoamyl alcohol (25:24:1) and was precipitated in isopropanol.
Genomic DNA labeling. Genomic DNA was restricted to an average size of 2 to 5 kb by double digestion with EcoRI and HindIII. A total of 5 µg of DNA was fluorescently labeled by direct chemical coupling with the Label-IT (Mirus Corp., Madison, Wis.) dyes cyanine 3 (Cy3) and Cy5, as recommended by the manufacturer. Probes were purified from the incorporated dyes by sequentially passing samples through SigmaSpin (Sigma, Oakville, Ontario, Canada) and Qiaquick (Qiagen) columns. Labeled DNA sample yields and dye incorporation efficiencies were calculated by using an ND-1000 spectrophotometer (Nanodrop, Rockland, Del.).
Microarray hybridizations. The hybridization profile for each strain was obtained by cohybridizing labeled DNA from the test strain and from strain NCTC 11168 (control) to our microarray. By convention, the DNA from test strains was labeled with Cy5 and that from the control strain was labeled with Cy3, although reciprocal labeling was performed with selected strains to test for potential dye incorporation bias. Labeled samples were normalized by selecting test and control sample pairs with similar dye incorporation efficiencies. Equivalent amounts (1 to 2 µg) of labeled test and control samples were pooled, lyophilized, and then resuspended in 35 µl of hybridization buffer (1x DIGEasy hybridization solution [Roche, Laval, Quebec, Canada], 0.5 µg of torulla yeast tRNA per µl, 0.5 µg of denatured salmon sperm genomic DNA per µl). The probes were denatured at 65°C for 5 min, cooled to room temperature, and applied to the microarray. Hybridizations were performed overnight at 37°C under glass coverslips (24 by 42 mm) in a high-humidity chamber. Microarrays were washed two times for 10 min each time at 50°C in 2x SSC (1x SSC is 0.15 M NaCl plus 0.015 M sodium citrate)-0.1% sodium dodecyl sulfate, two times for 5 min each time at 50°C in 0.5x SSC, and once for 5 min at 50°C in 0.1x SSC. The slides were spun dry (500 x g, 5 min) and stored in light-tight containers until they were scanned.
Data acquisition and analysis. The microarrays were scanned with a Chipreader laser scanner (Bio-Rad), according to the recommendations of the manufacturer. Spot quantification, signal normalization, and data visualization were performed with the program ArrayPro Analyzer (version 4.5; Media Cybernetics, Silver Spring, Md.). Net signal intensities were obtained by performing local-ring background subtraction, and spots with a signal less than five times greater than the background signal were excluded from the analysis. Signal intensities for triplicate spots were averaged, and the data from each channel were adjusted by subarray normalization by using cross-channel Loess regression. The ratio of the signal for the test strain to that for the control strain for each gene was transformed to its base 2 logarithm (29), log2(tester signal/C. jejuni NCTC 11168 signal), hereafter referred to as the "log ratio," and genes with log ratios less than 0.97 were considered divergent. Technical variations in our methodology were tested for by selecting a subset of strains for replicate hybridizations and treating the data from replicates separately throughout the various analyses. Consistency in the data was assessed by direct comparison of the lists of variable genes obtained from each replicate. In order to examine mapping of variable genes to genomic regions, we organized all CGH data by assuming conservation of gene order (synteny) with C. jejuni NCTC 11168. Genes were assigned to the highly variable (HV) group if they were divergent in more than one strain. On the basis of our unpublished observations of CGH with C. jejuni RM1221, log ratios less than 3.3 are likely to represent genes that are highly divergent (HD) or absent in the tester strain (see Fig. 4A). Genes were assigned to the HD group if the lowest observed log ratio for the gene was less than 3.3 for any of the strains in the data set. All non-HD genes were assigned to the moderately divergent (MD) group.
![]() View larger version (35K): [in a new window] |
FIG. 4. Association between high levels of divergence, the occurrence of codivergent clusters, and divergence in multiple strains. (A) Genes which were divergent in experiments with strain RM1221 and NCTC 11168 CGH (log ratio < 0.97) were binned by log ratio value. The BLAST server at TIGR was used to examine the homology of the corresponding NCTC 11168 genes to the RM1221 genome sequence. Gray bars, the number of genes in each bin for which BLAST hits indicated detectable sequence identities; black bars, genes without BLAST hits in the RM1221 genome. On the basis of these data, a log ratio <3.3 was used as a cutoff to define HD or absent genes from CGH data. (B) Average log ratios [log2(test signal/control signal)] for HD and MD genes. A statistically significant difference in the average log ratios for the two groups can be observed. (C) Percentage of divergent genes that were variable in multiple strains. HD genes are exclusively found to be divergent in multiple strains (100%; 122 of 122). In contrast, MD genes have a similar likelihood of being divergent in a single strain or in multiple strains (45.7 and 54.3%, respectively). (D) Percentage of divergent genes that were adjacent in the C. jejuni NCTC 11168 genome. The majority of HD genes have codivergent neighbors (95.9%; 117 of 122), whereas this value is only 41.3% among MD genes. The figure is based on raw microarray CGH data for data sets II and III.
|
|
|
|---|
![]() View larger version (21K): [in a new window] |
FIG. 1. Cumulative data from microarray-based CGH surveys of C. jejuni. (A) Divergent genes observed in data sets I, II, and III analyzed in this survey (see text). Of the 542 divergent genes observed in the collective data set, 209 are variable in two or more data sets (gray shading). The remaining 333 genes were variable in only one of the data sets. (B) Prevalence of genes divergent in a single strain (singletons) in three different C. jejuni CGH data sets. Whether analyzed separately or in various combinations (1, data set I; 2, data set II; 3, data set III; 4, data sets I and II; 5, data sets I and III; 6, data sets II and III; 7, data sets I, II, and III), a significant number of divergent singletons (striped bars) were obtained in the various data sets.
|
Although differences in the methodology used to define divergent genes prevented us from directly incorporating data set IV (22) into the meta-analysis data set, we were able to compare the gene conservation trends obtained in each data set (Fig. 2). Of 266 genes that were variable in data set IV, 78% (209 of 266) had variable counterparts in the meta-analysis data set. Despite the overlap between data sets, of 542 genes that were variable in the meta-analysis data set, 61% (333 of 542) had no variable counterparts in data set IV. Similarly, 57 genes that were variable in data set IV had no variable counterparts in the meta-analysis data set. Of note, more than half of the variable genes unique to data set IV (33 of 57) were divergent singletons. The remaining singletons in data set IV (n = 31) mapped to variable genes previously identified in the meta-analysis data set.
![]() View larger version (62K): [in a new window] |
FIG. 2. Survey of genetic variability in C. jejuni. Divergent genes were determined for each strain, and the percentage of strains showing divergence at each gene position was calculated and plotted as a histogram according to their position on the genome strain NCTC 11168 (18). The results for data set III (A), data set I (7) (B), data set II (14) (C), and data set IV (22) (D) are shown. Most variable genes map to 16 hypervariable regions in the C. jejuni NCTC 11168 genome (Table 2). These include the lipooligosaccharide biosynthesis locus (L), the flagellar biosynthesis locus (F), the capsular polysaccharide biosynthesis locus (CP), and the restriction-modification locus (RM). Although similar trends were observed for all four data sets, data set III (A) shows better resolution because of the larger sample size.
|
20% (n = 322) of the genes in strain NCTC 11168, with
80% of genes showing high degrees of conservation among all strains. The investigators acknowledged that this level of conservation was likely to be overestimated because of the small number of strains studied. When we combined our data for 51 strains with data for 46 strains from three previous studies (7, 14, 22), 599 genes, or 36.6% of the 1,634 genes in C. jejuni NCTC 11168, were detectably divergent in at least one strain. The almost 10-fold increase in the number of strains included in this meta-analysis uncovered an additional 277 divergent genes. Even though the microarray data cannot provide evidence on the conservation of gene order, if we assume only localized regions of synteny between test strains and genome strain NCTC 11168, more than half of all divergent genes in our meta-analysis mapped to 16 well-defined genomic regions likely to represent functionally related groups of genes (Fig. 2; Table 2). Most of the variable genes in the meta-analysis data set converged onto variable loci previously defined by Dorrell et al. (7) and Pearson et al. (22); these include the lipooligosaccharide, capsular polysaccharide, and flagellar biosynthetic loci and the restriction-modification locus (Fig. 2). In several cases, variable loci were expanded by the additional data. For example, the variable locus between Cj0295 and Cj0309c (Fig. 2, region 4) was increased by an additional seven genes, on the basis of the CGH meta-analysis data set. Similarly, the restriction-modification locus (Fig. 2, region 14), which spanned from Cj1549c to Cj1556 in the original data set of Dorrell et al. (7), was increased by an additional 10 genes (from Cj1543 to Cj1563c) when the cumulative data were taken into account.
|
View this table: [in a new window] |
TABLE 2. Hypervariable region endpoints illustrated in Fig. 2a
|
![]() View larger version (36K): [in a new window] |
FIG. 3. Gene conservation levels across different COG groups. Conservation levels were calculated as the percentage of variable genes in each COG group on the basis of the cumulative data from data sets I, II, III, and IV (black bars). The percentage of variable genes belonging to the HV group (striped bars) and the HD group (white bars) was also calculated for each COG. The HV and HD genes are defined in the text. COG X was created by the authors to denote all genes that do not fall under all other defined COG groups. HD and HV genes are not mutually exclusive, as their sum can exceed the total number of variable genes. PTM, posttranslational modification.
|
(i) HD group. Our preliminary observations on the microarray data for C. jejuni RM1221 compared with those for NCTC 11168 CGH showed that all highly negative log ratios (log ratios < 3.3) corresponded to genes which are found in strain NCTC 11168 and which are absent from strain RM1221 (Fig. 4A). A detailed analysis of CGH data from data sets II and III revealed genes in which a minimum log ratio of less than 3.3 was observed in at least one strain. These genes were assigned to the HD group, and all other variable genes were assigned to the MD group. Of the variable genes in data sets II and III, 122 showed highly negative log ratios for at least one strain, providing unambiguous evidence for either high levels of sequence divergence or gene absence. For each variable gene, we calculated the average log ratio for all strains in which the gene was divergent and found that the average log ratio of the HD genes is approximately 1 log2 unit lower than that of the MD genes, and thus, HD genes have a tendency toward highly negative log ratios (Fig. 4B).
(ii) HV group. More than two-thirds (268 of 391) of the variable genes found in data sets II and III were HV in multiple strains. Although 54.3% (146 of 268) of the MD genes were also HV, every HD gene was also HV (122 of 122) (Fig. 4C).
(iii) Divergent neighbors. Although genomic rearrangements cannot be detected by CGH analysis, there is growing evidence that variable genes in C. jejuni are found in clusters (22). Thus, the presence of genes that have decreased hybridization signals within a single strain and that are adjacent in NCTC 11168 provides stronger evidence of gene divergence or gene absence than does the occurrence of genes without divergent neighbors. Fifty-eight percent of all variable genes that are adjacent to each other in the C. jejuni NCTC 11168 genomic sequence were found to be codivergent in at least one of the strains studied, and these divergent neighbors were often functionally related. Although only 41.3% of the MD genes had divergent neighbors, 95.9% of the genes in the HD group had divergent neighbors within the same strain (Fig. 4D).
The results summarized in Fig. 5 show that of the 122 HD genes, 117 (95.9%) fulfilled the two additional criteria, i.e., HV and divergent neighbors (Fig. 5E), whereas this value was only 34.6% (93 of 269) among the MD genes (Fig. 5A). In addition, whereas the MD genes tended to vary in small numbers of strains (n = 269, mean = 4.0, standard deviation = 6.3), HD genes tended to be variable in larger numbers of strains (n = 122, mean = 26.5, standard deviation = 15.9) (Fig. 6). As genes with detectable divergence across various data sets are likely to represent a useful set of typing markers, it is significant that
57% (70 of 122) of the HD genes were also found to be HV in all four C. jejuni data sets included in the meta-analysis. Thus, it appears that our selection criteria independently converge on the HV and HD genes common to all four data sets.
![]() View larger version (30K): [in a new window] |
FIG. 5. Summary of HD genes obtained from microarray-based CGH of C. jejuni. Of 391 variable genes in data sets II and III, 122 genes were HD in one or more strains (E to H). A very high percentage (95.9%; 117 of 122) of all HD genes have variable neighbors within a single strain and are divergent in multiple strains (HV) (E). The results are based on raw microarray CGH data from data sets II and III.
|
![]() View larger version (30K): [in a new window] |
FIG. 6. Relationship between gene divergence and intraspecies variability. The distributions of MD and HD genes (white squares and black circles, respectively) show little overlap. MD genes tended to be variable in small numbers of strains (n = 269, mean = 4.0, standard deviation = 6.3), whereas HD genes displayed high degrees of intraspecies variability (n = 122, mean = 26.5, standard deviation = 15.9).
|
|
|
|---|
One major challenge of microarray-based CGH remains data interpretation. Two biological processes, gene divergence and gene loss, are inferred solely on the basis of differential hybridization signals that are largely the result of a spectrum of degrees of gene divergence that can culminate in gene loss. Kim et al. (13) have argued that the use of arbitrary thresholds tends to underestimate the true number of outliers (divergent genes) in microarray-based CGH analysis and have devised a method that increases the sensitivity of outlier detection by dynamically computing a threshold based on the distribution of log ratio values for each hybridization experiment. For the purposes of exploratory comparative genomics, the increased sensitivity of this method is clearly superior to that from the use of a static threshold, but one drawback of this approach is that narrow log ratio distributions can produce thresholds with decreased stringencies that may overestimate the numbers of outliers. In addition, the use of the log ratio distribution ignores the effects of signal intensity and dynamic range on outlier detection, especially as low-intensity data are inherently less reliable because of low signal-to-noise ratios. For strain classification and genotyping, it is crucial that only unambiguous gene divergence data be used for analysis, and new methods which incorporate dynamic threshold determination but which apply intensity-dependent corrections to the threshold will need to be developed. In their absence, we have chosen to use a conservative linear threshold, a log ratio of 0.97, to assign gene divergence. Our decision to select this value was based on a high-resolution exploration of the range of thresholds from 0.75 to 3.0 (results not shown). Whereas increasing the stringency of the threshold to values below 0.97 led to a modest but steady drop in the number of genes assigned as divergent, even small decreases in the stringency of the threshold to values above 0.97 led to large increases in the number of potential false-positive results. It would therefore appear that thresholds at or near this value represent a good compromise between maximizing the sensitivity of divergent gene detection and minimizing the number of false-positive results.
Previous microarray-based CGH studies with C. jejuni and other species have grouped divergent and deleted genes into a single category because present analytical tools are unable to make the distinction between the two (7, 13). On the basis of empirical CGH work with strain RM1221 (Fig. 4A), we have determined that moderately negative log ratios are more likely to represent gene divergence events, whereas genes with highly negative log ratios are more likely to represent gene absence events. A detailed examination of the microarray CGH data allowed us to make the distinction between genes that exhibit high levels of divergence and genes that exhibit moderate levels of divergence on the basis of differences in the amplitudes of their respective log ratio values. Of the 599 genes observed to be divergent in the four data sets, 122 qualified as HD on the basis of the highly negative log ratios observed in at least one of the strains in data sets II and III.
In order to identify a set of divergent genes that would be most useful for genotyping, we coupled the data for genes with high intraspecies variability, as detected by CGH, with additional lines of evidence of potential biological importance: high degrees of divergence in one or more strains and the occurrence of adjacent divergent genes within a single strain. Although the distinction between highly and moderately divergent genes based on highly negative log ratios requires further refinement, it is worth noting that 96% (117 of 122) (Fig. 5E) of the HD genes that we identified have divergent neighbors and are divergent in multiple strains. More significantly, as these genes tend to provide unambiguous microarray results and tend to have high intraspecies variabilities, they represent an excellent set of polymorphic markers that could form the basis for a highly discriminatory genotyping method. Of the 84 genes that were HV in each of the four data sets analyzed, 70 were also HD (see http://ibs-isb.nrc-cnrc.gc.ca/ibs/immunochemistry/suppInfo_Taboada_2004a_e.html). Well-known polymorphic regions (e.g., lipooligosaccharide biosynthesis, flagellar biosynthesis, and capsular polysaccharide biosynthesis) are a significant source of HV and HD genes. These loci (Fig. 2, regions 11, 12, and 13, respectively) contain 7, 15, and 14 HV and HD genes, respectively. Another region contributing a large number of HV and HD genes (n = 10) is the region from Cj0480c to Cj490 (Fig. 2, region 6), which Pearson et al. (22) have termed plasticity region 2. This region contains truncated genes for altronate hydrolase and aldehyde dehydrogenase, a putative sugar transporter and a putative oxidoreductase, respectively, among others. Hypervariable region 3 (Fig. 2, region 3), which contains several iron uptake transporters and a putative membrane siderophore, contains three contiguous HV and HD genes (Cj0178-Cj0179-Cj0180). While many HV and HD genes have known function, a large number of HV and HD genes (16 of 70; 23%) represent putative or hypothetical genes of unknown function. While the preponderance of well-established polymorphic genes among the list of 70 HV and HD genes validates the results of our CGH meta-analysis, our results also show that a significant number of HV and HD genes represent novel polymorphic typing targets.
The main advantage of the meta-analysis approach described here is that increased sample sizes tend to comprise a greater degree of genetic diversity, reducing the effect of sampling biases. As the manuscript was being finalized, data set IV (22) became publicly available, which enabled us to perform a comparison of the global trends obtained from large-scale sampling using our original meta-analysis data set (data sets I, II, and III) and those obtained from the 18 strains evaluated in that study. There was significant overlap in the variable genes obtained from both data sets, as 78% of the variable genes (208 of 266) in data set IV had variable counterparts in the meta-analysis data set. However, despite similar gene conservation profiles, data set IV contained 58 variable genes that had shown no variability among the 79 strains in the original meta-analysis data set. Thus, despite the large sample size used in this study, we recognize that a more comprehensive study of C. jejuni comparative genomics will require targeted sampling of strains expected to be genetically diverse on the basis of a number of different epidemiological or phenotypic parameters.
Using a meta-analysis approach, we have been able to identify genes that have high degrees of intraspecies variability in C. jejuni and that can be targeted for genotyping purposes. Since most present molecular typing methodologies are based on DNA polymorphisms with poorly defined biological significance, a new generation of genotyping methods that incorporate data from microarray-based CGH will have the advantage of being founded on tracking of the conservation of genes of interest at the whole-genome level. This fundamental difference represents a major leap toward the rational design of molecular typing methods that couple biologically relevant information, namely, the gene conservation profiles that ultimately govern a strain's phenotype, to epidemiological surveillance.
Funding for this work has been provided to E.N.T., R.R.A., C.D.C., W.A.F., O.L.M., M.J.R., and J.H.E.N. through the National Research Council Genomics and Health Initiative and to D.T.M. and J.M.F. through Health Canada and the University of Ottawa. Preliminary sequence data were obtained from the Institute for Genomic Research at http://www.tigr.org.
|
|
|---|
This article has been cited by other articles:
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Copyright © 2009 by the American Society for Microbiology. For an alternate route to Journals.ASM.org, visit: http://intl-journals.asm.org | More Info»