Direct Detection of Shigella in Stool Specimens by Use of a Metagenomic Approach

ABSTRACT The underestimation of Shigella species as a cause of childhood diarrhea disease has become increasingly apparent with quantitative PCR (qPCR)-based diagnostic methods versus culture. We sought to confirm qPCR-based detection of Shigella via a metagenomics approach. Three groups of samples were selected from diarrheal cases from the Global Enteric Multicenter Study: nine Shigella culture-positive and qPCR-positive (culture+ qPCR+) samples, nine culture-negative but qPCR-positive (culture− qPCR+) samples, and nine culture-negative and qPCR-negative (culture− qPCR−) samples. Fecal DNA was sequenced using paired-end Illumina HiSeq, whereby 3.26 × 108 ± 5.6 × 107 high-quality reads were generated for each sample. We used Kraken software to compare the read counts specific to “Shigella” among the three groups. The proportions of Shigella-specific nonhuman sequence reads between culture+ qPCR+ (0.65 ± 0.42%) and culture− qPCR+ (0.55 ± 0.31%) samples were similar (Mann-Whitney U test, P = 0.627) and distinct from the culture− qPCR− group (0.17 ± 0.15%, P < 0.05). The read counts of sequences previously targeted by Shigella/enteroinvasive Escherichia coli (EIEC) qPCR assays, namely, ipaH, virA, virG, ial, ShET2, and ipaH3, were also similar between the culture+ qPCR+ and culture− qPCR+ groups and distinct from the culture− qPCR− groups (P < 0.001). Kraken performed well versus other methods: its precision and recall of Shigella were excellent at the genus level but variable at the species level. In summary, metagenomic sequencing indicates that Shigella/EIEC qPCR-positive samples are similar to those of Shigella culture-positive samples in Shigella sequence composition, thus supporting qPCR as an accurate method for detecting Shigella.

enomic reads were then aligned to these 1-kb virulence genes using bowtie2 using the paired-end end-to-end read mode assignment; i.e., a read is assigned to an amplicon only if both pairs align entirely on the same amplicon.
Statistics. We used Mann-Whitney U test to examine whether the read counts of Shigella species, human sequences, or virulence factors in the culture Ϫ qPCR ϩ samples were different from the culture ϩ qPCR ϩ or culture Ϫ qPCR Ϫ samples. Correlation of read counts between various virulence factors was tested by regression analysis using analysis of variance (ANOVA). Two-tailed P values were calculated, and values of Ͻ0.05 were considered statistically significant. All analyses were performed using IBM SPSS version 24.

RESULTS
Direct detection of Shigella in stool samples by metagenomics sequencing. The 27 selected diarrheal samples were previously tested with a broad range of pathogenspecific qPCRs utilizing TaqMan Array Cards as described previously (11). Shigella/EIEC was the primary diarrhea-associated pathogen (i.e., with the highest odds ratio for diarrhea) in both the culture ϩ qPCR ϩ and the culture Ϫ qPCR ϩ groups (average quantification cycles of 18.7 Ϯ 1.03 versus 18.7 Ϯ 1.05; P ϭ 0.594). Additional diarrheaassociated pathogens were identified in specimens, as indicated in Table S3 in the supplemental material, and included rotavirus, Cryptosporidium spp., astrovirus, Helicobacter pylori, and adenovirus 40/41. The 27 metagenomic libraries had an average of 300M high-quality paired-end reads. There was no difference in total reads among the three sample groups (295.4 Ϯ 42.8 ϫ 10 6 for Shigella culture ϩ qPCR ϩ samples, 295.4 Ϯ 47.7 ϫ 10 6 for Shigella culture Ϫ qPCR ϩ samples, and 329.5 Ϯ 59.2 ϫ 10 6 for Shigella culture Ϫ qPCR Ϫ samples; P Ͼ 0.05). However, nearly half of the reads (41.8 Ϯ 39.5%) were human sequences and varied from 0.3 to 97.8% of the reads in individual libraries. Shigella culture Ϫ qPCR Ϫ samples had a much lower composition of human sequences (13.2 Ϯ 20.0%, P Ͻ 0.05) than the other two groups of samples (61.7 Ϯ 43.1% for Shigella culture ϩ qPCR ϩ samples and 50.3 Ϯ 37.0% for Shigella culture Ϫ qPCR ϩ samples; P ϭ not significant). Shigella qPCR ϩ samples had high read counts of human sequences (82.4% for the 4 dysenteric samples and 58.5% for the 14 nondysenteric samples; P ϭ 0.142). To adjust for potential sampling bias introduced by the variable human sequence composition, the read counts were compared as the proportion of nonhuman sequence reads (Fig.  1). Overall, 0.65 Ϯ 0.42% of nonhuman sequence reads were assigned by Kraken as Shigella species for culture ϩ qPCR ϩ samples and 0.55 Ϯ 0.31% for culture Ϫ qPCR ϩ samples (P Ͼ 0.05). Both were significantly higher than those of culture Ϫ qPCR Ϫ samples (0.17 Ϯ 0.15%, P Ͻ 0.05).
The read counts of Shigella/EIEC-specific virulence genes ipaH, ipaH3, ial, ShET2, virA, and virG are shown in Fig. 2. For all six gene targets, culture Ϫ qPCR ϩ samples yielded read counts similar to those of the Shigella culture ϩ qPCR ϩ samples, whereas few reads were generated among Shigella culture Ϫ qPCR Ϫ samples. A high correlation was observed between these genes (e.g., R 2 ϭ 0.889 to 0.932 between ipaH and any of the other 5 genes). The read counts of 18 ipaH-positive samples showed low correlation (e.g., R 2 ϭ 0.0288, 0.006 for ipaH and virA, respectively) with the corresponding ipaH qPCR quantification cycles; however, this improved when normalized to nonhuman sequence reads (e.g., R 2 ϭ 0.259 and 0.361). Unfortunately, the genes responsible for biosynthesis of Shigella O antigen are common to many other bacteria and thus cannot be used to detect or distinguish Shigella from EIEC (24). There is, however, another region, Mxi-Spa-ipa, that has been reported to have some ability to discriminate Shigella from EIEC (23), and these genes showed a similar pattern of positivity among ipaH PCR ϩ specimens and negativity among ipaH PCR Ϫ specimens, supporting the idea that these ipaH PCR ϩ specimens were from Shigella (see Table S4 in the supplemental material).
Testing Kraken with whole-genome sequences of Shigella isolates. In order to validate the precision and recall of Kraken for Shigella at the genus and species level, we tested the 12 DNA sequences from well-characterized isolates of Shigella from Pakistan. At the Shigella genus level, the Kraken precision ranged from 86 to 99.2% (corresponding to the proportion of reads assigned to Shigella out of the reads classified to a genus [15]). At the species level, the precision was either close to 2% or above 80% (Table 1). As shown in Fig. 3, all of the low precision occurred among S. flexneri isolates that were in the S1 phyletic lineage (the S1 lineage includes S. flexneri, S. dysenteriae, and S. boydii as described previously [20]). In contrast, species precision for S. sonnei and S. flexneri of the S2 and S5 lineages, respectively (that contain one nominal species), was Ͼ90%. Accordingly, among the nine culture ϩ qPCR ϩ stool  Table S1 in the supplemental material. NS, not significant. samples, four S. sonnei (S2 lineage) and three S. flexneri (all of S5 lineage) culturepositive specimens were correctly identified by metagenomics sequencing as those with the highest read count, whereas two S. flexneri samples (1 of S1 lineage) were misclassified as other Shigella species (see Table S5 in the supplemental material).
Detection of other pathogens through metagenomics sequencing. In addition, we examined whether the detection of DNA viruses and certain other bacterial pathogens by qPCR (see Table S3 in the supplemental material) could also be supported by the metagenomic results. For many pathogens, such as Campylobacter jejuni and C. coli, adenovirus 40/41, enteroaggregative E. coli (EAEC), and typical enteropathogenic E. coli (EPEC), the pathogen-specific read counts identified by Kraken demonstrated a quantitative relationship with qPCR quantification cycle (C q ) values of the relevant target genes (see Fig. S1 in the supplemental material). However, there was also substantial metagenomic background in qPCR-negative specimens.

DISCUSSION
The main finding of this work is that the metagenomic composition of Shigella qPCR-positive samples is similar to that of culture-positive samples. In other words, the Shigella qPCR results based on ipaH appear to be robust even if cultures are negative. Our result is consistent with those of Lindsay et al. (10) that showed that the culture Ϫ qPCR ϩ samples were indistinguishable from culture ϩ qPCR ϩ samples by clinical criteria.
In the metagenomic analysis, we examined the read counts of known Shigella/EIECspecific virulence genes. We surveyed ipaH, ial, virA, virG, ShET2, and ipaH3. ipaH is the most commonly used target gene for diagnostic PCR assays and is present in multiple copies on both chromosome and plasmid (13). ial, virA, virG, and ShET2 are found in single copy on the plasmid (25)(26)(27), whereas ipaH3, which is a conserved region previously identified to distinguish Shigella from the majority of E. coli, is present in single copy on the chromosome (20). The metagenomic read counts of these genes were highly correlated with each other within a sample, which is consistent with our previous results (9, 11). The read counts distinguished the Shigella qPCR positives from a Phyletic lineage was assigned according to the system of Sahl et al. (20). b Sensitivity refers to the proportion of sequences assigned to the correct Shigella genus or species. Precision, i.e., the positive predictive value, refers to the proportion of correct classifications, out of the total number of classifications attempted.
negative and were similar between culture ϩ qPCR ϩ and culture Ϫ qPCR ϩ samples. As expected, ipaH had the highest read counts for all the Shigella qPCR-positive samples, usually 5-to 7-fold higher than ial, virA, virG, and ShET2. In contrast, ipaH3 had the lowest read counts, usually 20-to 30-fold lower than that of ipaH. This confirmed that ipaH is likely the most efficient available target for PCR detection of Shigella species. As a benchmark comparison, we tested 12 genome sequences from Shigella isolates carefully identified to the species level to determine the precision and recall for Kraken, Clark, MetaPhlan2, and Kaiju. Kraken outperformed for both precision and recall at the genus level. We also tested the assignment of reads to the species level. The Shigella genus includes four nominal species-S. sonnei, S. flexneri, S. dysenteriae, and S. boydii-that can be accurately distinguished by serotyping. The precision for the genus was very good (>86%) and, although the precision for the species could also be very good (>88%), sometimes it was extremely poor (ϳ2.4%). The precision of the sequences was dependent upon the position of the sequence in the phylogenetic tree (Fig. 3). The previously defined phyletic lineage S1 (20) was the source of all of the assignments of sequences that had low precision. This lineage, S1, is characterized by the presence of three nominal species: S. flexneri, S. dysenteriae, and S. boydii. Thus, in this case (and in the S3 lineage, where S. dysenteriae and S. boydii co-occur) (20, 28), all programs, including Kraken, that assume a monophyletic lineage for its assignment (29) produce low precision. However, for species such as S. sonnei that are monophyletic, the assignment to species is very good.
Another observation was that Shigella qPCR-positive samples yielded a much higher proportion of human sequence reads. We suspect this is because clinical shigellosis is invasive and characterized by the presence of blood, mucus, and epithelial disruption. Shigella was most likely the causative agent in these samples because of its strong association with diarrhea at high quantity, even if in the presence of other pathogens (7,10,11).
This study had limitations, such as the small sample size (limited by cost), the depth of metagenomic sequencing, the challenges posed by variable levels (up to 97.9%) of human DNA in fecal samples, low Shigella abundance (0.27 Ϯ 0.32%), the high level of sequence similarity of Shigella genome with E. coli genome in available databases, and imperfect S. flexneri species assignment by Kraken. Nonetheless, our study clearly demonstrates Shigella ipaH qPCR-positive specimens contain genes from Shigella, whereas qPCR-negative specimens do not. Thus, culture misses true cases of Shigella and molecular diagnosis with appropriate quantitative cutoffs provides a more accurate method for detecting Shigella.