**DOI:**

Several methods can be used to diagnose *Helicobacter pylori* infection. Most of them require upper gastrointestinal endoscopy for retrieval of a gastric biopsy specimen. For serology, no upper gastrointestinal endoscopy is required, but blood must be obtained to detect *H. pylori* antibodies. *H. pylori* serology is attractive in comparison with other diagnostic methods because it is simple, inexpensive, and less of a burden for the patient. Several kits for the detection of *H. pylori* by serology have become commercially available since the discovery of *H. pylori* by Warren (87) in 1983. Most of these*H. pylori* serology kits are based on various antibody preparations and different techniques.

The introduction of commercially available *H. pylori*kits has led to an increase in the number of studies that have evaluated kit characteristics. Recently, a systematic review comparing the accuracies of commonly used commercial serology kits for the detection of *H. pylori* infection has been conducted (48). To account for the different reference standards and designs used by various investigators, only studies that evaluated pairs of serology kits and that compared the kits only within those studies were included. A more appropriate method of comparing different diagnostic tests and the performance of different interpreters of one test is to calculate the area under the receiver operating characteristics curve (AURC) for each test (83). To correct for dependence between AURCs within the same study population, we used a random-effect model. By reviewing the literature, we also tried to determine whether *H. pylori* serology can accurately diagnose *H. pylori* infection. However, in contrast to the study by Loy et al. (48), we reviewed all the studies that evaluated commercially available *H. pylori*serology kits.

## DATA COLLECTION AND EXTRACTION

Identification and eligibility of publications.A computerized and manual literature search was performed in early 1998. Relevant publications were identified in MEDLINE (1983 to 1997) with the medical subject heading terms *Helicobacter* or *pylori*,*Sero**, *Sera**, *Seru**,*Sensitivity*, and *Human *in checktags. Furthermore, additional publications were retrieved by reviewing references in publications found by MEDLINE. The criteria used to select publications were as follows: *H. pylori* infection was established before treatment; the *H. pylori* serology kit was commercially available; the number of patients, prevalence of infection, and the sensitivity and specificity of the *H. pylori* serology kit were described or could be calculated; and the studies were published in Dutch, English, French, or German.

Data analysis.New diagnostic tests are mainly evaluated by determining the sensitivity and specificity of the test. For evaluative purposes, the sensitivity and specificity are less useful (83). On the basis of the receiver operating characteristic (ROC) curve, we calculated the AURC, which is a measure for the diagnostic performance of a test (42). It is independent of cutoff points and reasonably immune to selection bias. Depending on the serology kit result, one should use different methods to calculate the AURC. We used a method to estimate the AURC for kits with a quantitative test result by using one combination of a true-positive rate and a false-positive rate on the basis of the assumption that the data for the *H. pylori*-infected and noninfected persons were logistically distributed and had equal variances (84). On the other hand, for serology kits with a qualitative test result, we used the trapezium method to calculate the AURC (36). However, comparison of the *H. pylori* serology kits revealed that the diagnostic performance differed substantially depending on how the AURC was calculated. Therefore, we decided to estimate the AURCs by the trapezium method, irrespective of the distribution of the test result. Use of the trapezium method to estimate the AURC of a serology kit with a quantitative test result possibly underestimates its diagnostic performance (36).

The AURC was used to explore possible differences between clinical features of study populations and methodological aspects of the serology kits. The tests were stratified into the following: report type (abstract, letter, or article), publication year (1991 to 1997), whether the study population was a consecutive series or a selection of a relevant study population, whether or not the patients had dyspeptic symptoms, the nationality of the study population, the reference standard used, the serology kit used, kit scale (quantitative, qualitative), the type of immunoglobulin (immunoglobulin A [IgA], IgG, and IgM simultaneously, IgA alone, or IgG alone) used to detect serum antibodies, the analysis technique of the serology kits (agglutination, enzyme immunoassay [EIA], enzyme-linked immunosorbent assay [ELISA], fixation, or immunochemical analysis), and whether whole blood or serum was used. We could not examine whether the generation of the test influenced performance because few studies mentioned this.

Statistical methods.We first tried to model the heterogeneity between the studies by means of an ordinary least-squares regression equation, in which all the clinical features and methodological aspects were simultanously included. Unfortunately, this was not possible because of convergence problems. A best subset analysis was also not possible for the same reasons. Therefore, we decided to perform a separate regression analysis for each clinical feature. It is very likely that the AURCs for different serology kits are correlated when they are used with the same study population. By introducing a random effect for study population, we could model dependency between kits within the same study population (24) (see the ). Moreover, the imprecision of the AURCs varied per study. In order to correct for the heterogeneity in the precision of the AURCs caused by different study sizes, we also performed a weighted regression analysis with weights proportional to 1/SE^{2}, where SE is the standard error (see the ). Whenever AURC is equal to 1 or 0, SE will be 0. If this occurred, the study was excluded from the analysis.

For each regression model an overall *F*(NDF,DDF) test, where NDF is the degree of freedom in the numerator and DDF is the degree of freedom in the denominator of the *F* test, was used to examine whether the hypothesis β_{1} = 0, β_{2} = 0, . . ., β_{k} = 0 (no fixed effect) should be rejected. Akaike’s information criterion (AIC) is given to choose between the ordinary least-squares model, the random-effects model, and the weighted-random-effects model for each feature. The higher the AIC, the better the model fit. We also performed a weighted-random-effect regression analysis, using the significant features from the previous model simultaneously, in order to correct for possible confounding. These analyses were performed with SAS software (70). For multiple comparisons the Bonferroni correction was used to keep the overall α level at 0.05.

## DATA SYNTHESIS

We found a total of 83 publications (1-23, 25-35, 37-41,43-47, 49-69, 71-82, 85, 86, 88-91) with the MEDLINE search and by reviewing the references in the articles from the original search. Most publications had to be excluded because the *H. pylori* serology kit used was not commercially available. In those that could be included a total of 177 tests with 36 different commercially available *H. pylori* serology kits had been performed with 26,812 patients (Table 1). The medians (25 and 75% quantiles) of the sensitivity and specificity for *H. pylori* serology were 92% (85 and 96%) and 83% (73 and 92%), respectively. However, the sensitivities and specificities of the *H. pylori* serology kits ranged considerably between tests (Fig. 1).

Results for two tests were excluded from the regression analysis. Owing to their perfect diagnostic performance (the AURCs were 1) we could not calculate the SE of the AURC. According to the standard regression models, several clinical features and methodological aspects caused differences in diagnostic performance (Table 2). However, after correcting for the dependence between AURCs within the same study and the heterogeneity in the precision of the AURCs caused by different study sizes, only two aspects remained statistically significantly different. First, a major clinical feature of the study population that led to heterogeneity was the way in which the study population had been selected (Table 3). Second, because the investigated *H. pylori* serology kits were based on various antibody preparations, the diagnostic performances differed substantially. In the final weighted-random-effects regression model in which the features “consecutive yes/no” and “type of antibodies measured” were included, the estimated AURC for a nonconsecutive patient series was 0.053 (*P* = 0.01) higher than that for a consecutive patient series. The estimated AURC for kits that measured “IgA antibodies only” was 0.063 (*P* = 0.01) lower than that for kits that measured “IgG antibodies only,” while for kits that measured “IgA, IgG, and IgM simultaneously” it was 0.22 lower (*P* = <0.001). The *P* values for the overall *F* test in the multivariate weighted-random-effect regression analysis for the categories consecutive patients and antibodies were <0.001 and 0.013, respectively. The AURC for the serology kits that measured “IgA, IgG, and IgM antibodies” was 0.16 (*P* = 0.001) higher than that for the kits that measured “IgA antibodies alone.” After correcting for the way that the study population had been selected, an evaluation of only the serology kits that measured IgG antibodies with more than five test kits revealed that the diagnostic performance of the Helico-G serology kit was significantly lower (*P*= <0.001) than that of the Anti-Hp serology kit (Table4). The overall *F*-test value for the serology kits with NDF equal to 8 and DDF equal to 47 was 4.07 (*P* = 0.001), and the overall *F*-test value for the consecutive category with NDF equal to 1 and DDF equal to 47 was 2.21 (*P* = 0.14).

## IMPLEMENTATION OF DIAGNOSTIC TESTS

Before implementing new diagnostic tests in clinical practice, careful evaluations must be done. Three topics are of importance (83). First, the test must have been evaluated with the indicated study population, i.e., the population of patients suspected of having the disease in question. The relevant population in this case consisted of a consecutive series of patients with dyspeptic complaints referred for upper gastrointestinal endoscopy. However, most studies analyzed a highly selected sample of patients with dyspepsia referred for upper gastrointestinal endoscopy. For a highly selected sample, *H. pylori* serology has excellent diagnostic performance. If a population of consecutive patients is tested, the diagnostic performance decreases.

The second topic of importance is determination of the diagnostic performance. For diagnostic tests, sensitivity and specificity are the most commonly used measures of test performance. Sensitivity and specificity are important parameters for diagnostic purposes but not for evaluative purposes. First, the use of different cutoff points for test positivity leads to various sensitivities and specificities. Second, the distribution of the test results for*H. pylori*-positive and -negative patients can vary considerably among studies because of selection. To overcome these problems, the presentation of the entire range of sensitivities and specificities at various cutoff points by a ROC curve results in better comparability of diagnostic tests. However, ROC curves and/or test result distributions were very sparsely presented in the publications. Fortunately, it is possible to make a fairly accurate estimation of the underlying ROC curve if one sensitivity and one specificity are mentioned (42). We think that this method is more appropriate for evaluation of the diagnostic performance of tests than the summary ROC technique used by Loy et al. (48). First, by analyzing the AURCs, the different cutoff points used for the same kit with different study populations were of no importance. Second, to compare pairs of serology kits, Loy et al. (48) needed more studies that had used two kits. By our method we could compare kits without any restrictions. A statistical test could be used to evaluate whether one was better than another. If the hypothesis of “no kit effect” is rejected, a multiple-comparison method can be used to compare pairs of tests. By introducing a random effect for study population as an alternative for the paired *t* test used by Loy et al. (48), we allowed for dependence between kits in the same study. Third, the weighted-regression method took the various study population sizes into account and corrected for them. Small studies with low AURCs will have smaller weights than large studies with high AURCs. The summary ROC analysis used equal weights for the studies involved, while it ignored the different study sizes and AURCs. Fourth, it seems to us that our method is likely to be more efficient for the testing of kits and other covariables because it incorporated studies and all kits in one analysis.

One problem remains: we do not know whether the random-effect model assumptions are correct. It is not clear whether the correlation between two kits and the correlation between two other kits within the same study population are equal. Maybe different correlations between different pairs of kits are better descriptions of reality. Unfortunately, most kits were used in only a few studies, and the number of kits used within one study was too small to thoroughly test the model assumptions. We used the MIXED procedure from the SAS software for the analysis. This procedure did not converge when all clinical features were included in the models because of too few observations. An automatic subsets analysis is not part of this SAS procedure. Therefore, we had to perform a univariate analysis. We could only perform a multivariate weighted-regression analysis with correction for random effects, in which the statistically significant clinical features from the univariate analysis were simultaneously evaluated in order to correct for possible confounding. A necessary condition for confounding is that the confounding variable is related to the feature under study and to the outcome (AURC). If a clinical feature or methodological aspect is not significant in a univariate model, then it is very unlikely to be significant in a model in which more features are included. However, the differences found by our method were confirmed in the analysis of kits and studies that fulfilled the requirements of Loy et al. (48). In agreement with Loy et al. (48), the Anti-Hp serology kit performed better than the Helico-G serology kit. Furthermore, the Malakit serology kit also displayed a higher although not statistically significant AURC than the Helico-G serology kit.

Finally, the relation between a new test and current diagnostic tests needs to be established. Many methods for the diagnosis of*H. pylori* infection are available. Because there is no consensus about a reference standard, several methods were used to identify *H. pylori* infection. The definition of the reference standard used in the publications ranged from only one diagnostic method (histology, culture, or rapid urease testing) having to be positive to more methods having to be positive (culture, rapid urease testing, and the urea breath test). The selection of a test as a reference reflects the personal preference of the investigator, which might lead to bias. Furthermore, the sensitivity and specificity of biopsy specimen-based methods vary and are frequently about 90%. Therefore, it is inappropriate to use other imperfect diagnostic tests as reference methods to measure diagnostic performance. However, the diagnostic performance of *H. pylori* serology was not influenced by any of the 15 different reference standards used.

## RECOMMENDATION

In contrast to the conclusion drawn by Loy et al. (48) the diagnostic performances of various serology kits differed substantially because commercially available serology kits were based on various antibody preparations and were used with different study populations. Our results showed that serology kits that measured IgA, IgG, and IgM simultaneously (Pyloriset latex, CFT *H. pylori*) or IgA alone (Pyloriset, GAP) for the detection of*H. pylori* antibodies in serum did not perform as well as those that measured only IgG antibodies. The overall performance of commercially available serology kits that measure IgG antibodies for the diagnosis of *H. pylori*infection showed that serology is an accurate means of diagnosing*H. pylori* infection in patients. Owing to the small differences in diagnostic performance between serology kits that measure IgG antibodies, other aspects, such as the price, ease of handling, or number of equivocal results, are becoming increasingly important when choosing a serology kit.

## Appendix

The regression equation used to model the heterogeneity between the studies by means of an ordinary least-squares regression analyses was as follows: _{j}, *j* = 1,...,*k*, is the regression coefficient, and *X _{j}
*,

*j*= 1,...,

*k*, is the dummy variable indicating the category of the clinical feature. The term ɛ is the residual which is normally distributed with variance ς

_{ɛ}

^{2}.

It is very likely that the AURCs for different serology kits are correlated when they are used within the same study population. By introducing a random effect for study population we could model dependency between kits within the same study population (24). The regression equation for the random effects model was as follows: *X _{j}
*,

*j*= 1,...,

*k*, and ɛ are as described above; β

_{j},

*j*= 1,...,

*k*, is now called the fixed effect; and

*b*,

_{l}*l*= 1,...,

*m*, is the random effect which is independent and normally distributed with common variance ς

_{b}

^{2}.

*S*,

_{l}*l*= 1,...,

*m*, is the dummy variable indicating the study population.

In order to correct for the heterogeneity in the precision of the AURCs caused by different study sizes, we also performed a weighted-regression analysis with weights proportional to 1/SE^{2}. The SE was computed according to the expression given by Hanley and McNeil (36): *n _{A}
* is the number of abnormal individuals (

*H. pylori*infected), and

*n*is the number of healthy individuals (

_{N}*H. pylori*noninfected). The expression for

*Q*

_{l}is given by

*Q*

_{l}= AURC/(2 − AURC), and that for

*Q*

_{2}is given by

*Q*

_{2}= (2 × AURC

^{2})/(1 + AURC). This formula was derived under the assumption that the ratings are on a scale that is sufficiently continuous not to produce “ties”. Although in our case we used dichotomous tests, we believed that the formula for SE would nevertheless be useful. We used the SE only for the weighting procedure and expected to obtain the same answer when the unknown “true” SE was proportional to the SE that we used.

## ACKNOWLEDGMENTS

We thank J. L. Severens and E. H. van de Lisdonk for comments that improved an earlier version of this paper.

- Copyright © 1998 American Society for Microbiology