Designing HIV Testing Algorithms Based on 2015 WHO Guidelines Using Data from Six Sites in Sub-Saharan Africa

ABSTRACT Our objective was to evaluate the performance of HIV testing algorithms based on WHO recommendations, using data from specimens collected at six HIV testing and counseling sites in sub-Saharan Africa (Conakry, Guinea; Kitgum and Arua, Uganda; Homa Bay, Kenya; Douala, Cameroon; Baraka, Democratic Republic of Congo). A total of 2,780 samples, including 1,306 HIV-positive samples, were included in the analysis. HIV testing algorithms were designed using Determine as a first test. Second and third rapid diagnostic tests (RDTs) were selected based on site-specific performance, adhering where possible to the WHO-recommended minimum requirements of ≥99% sensitivity and specificity. The threshold for specificity was reduced to 98% or 96% if necessary. We also simulated algorithms consisting of one RDT followed by a simple confirmatory assay. The positive predictive values (PPV) of the simulated algorithms ranged from 75.8% to 100% using strategies recommended for high-prevalence settings, 98.7% to 100% using strategies recommended for low-prevalence settings, and 98.1% to 100% using a rapid test followed by a simple confirmatory assay. Although we were able to design algorithms that met the recommended PPV of ≥99% in five of six sites using the applicable high-prevalence strategy, options were often very limited due to suboptimal performance of individual RDTs and to shared falsely reactive results. These results underscore the impact of the sequence of HIV tests and of shared false-reactivity data on algorithm performance. Where it is not possible to identify tests that meet WHO-recommended specifications, the low-prevalence strategy may be more suitable.

T he HIV rapid diagnostic tests (RDTs) are the main diagnostic tools for HIV screening and diagnosis in resource-constrained settings (1). Given the potential for the severe medical, psychological, and social impacts of HIV misdiagnosis and the evidence of elevated false-positive results from some settings, it is imperative that HIV diagnosis is confirmed to be both sensitive and specific (2).
In 2012 and 2015, the World Health Organization (WHO) published revisions of the HIV testing guidelines with different recommendations for low (Ͻ5%)-and high (Ն5%)-HIV-prevalence settings (1,3,4). These recommendations call for the sequential use of up to three different serological assays, including RDTs, for final HIV diagnoses. Whereas a first nonreactive test result is sufficient to provide a final negative result in both settings, two and three reactive assays are needed to provide final HIV-positive results in high-and low-prevalence settings, respectively (Fig. 1). The guidelines stipulate that each of the three RDTs should have a sensitivity of at least 99%, while the first RDT should have at least 98% specificity and the second and third RDTs at least 99% specificity; overall, the combination should be designed to minimize the potential for shared false reactivity. Different strategies for high-and low-prevalence settings were developed based on mathematical models using three theoretical assays assumed to meet the criteria described above to achieve an overall positive predictive value (PPV) of at least 99% (1). To date, however, these recommendations and the performance of the resulting algorithms have not been validated using real data from different field contexts.
Several factors could influence the design and performance of these algorithms. Although WHO-prequalified HIV RDTs met the minimum recommended sensitivity and specificity criteria in the prequalification evaluations, several reports from different countries indicate much poorer performance in real-world settings (5)(6)(7)(8)(9)(10)(11)(12). Moreover, little is known about shared false-reactivity results among different RDTs (13). The use of the same antigen (Ag) preparations to produce different tests, which is occurring with increasing frequency due to rebranding or relabeling arrangements among test manufacturers (1), can lead to shared cross-reactivity, though this may not be the only cause. Even low levels of shared cross-reactivity, or marginally substandard performance of one RDT, could have a meaningful impact on the performance of an algorithm.
Given concerns about false positivity raised by previous findings, over the period of 2011 to 2015 we conducted an evaluation of eight HIV RDTs and two simple confirmatory assays differentiating antibodies against several viral proteins (14). We used specimens collected at six HIV testing and counseling (HTC) centers in sub-Saharan Africa, the region most highly affected by HIV/AIDS, with approximately 70% of the total number of people living with HIV worldwide (15). Consistent with the aforementioned reports (5)(6)(7)(8)(9)(10)(11)(12), this study revealed lower-than-expected specificity for most of the tests and important variations by specimen origin (14). Here, we have used these data to validate the performance of simulated algorithms developed according to the latest WHO recommendations. Additionally, we explored the possibility of using algorithms incorporating simple confirmatory assays that could be suitable for use in lowand middle-income countries.

RESULTS
From August 2011 to January 2015, a total of 2,785 samples collected at the six HTC sites (comprising between 437 and 500 samples at each site) were sent to the reference laboratory. The HIV positivity rate by site ranged from 8.0% to 37.1% (Table 1). More information on the characteristics of clients included in the study are provided elsewhere (16). Using the reference algorithm, 1,306 were classified as HIV-positive clients (including 1 positive for HIV-2) and 1,474 as HIV-negative clients. Three samples with inconclusive reference results and two samples with reference results suggestive of acute infection were excluded from the analysis.
The performance of the HIV RDTs and simple confirmatory tests assessed individually and by origin of specimens is described elsewhere (14). Of a total of 438 specimens that gave at least one false reactive result, the majority gave a falsely reactive result with only one of the eight RDTs (n ϭ 295), 81 with two RDTs, 41 with three RDTs, 15 with four RDTs, 4 with five RDTs, and 2 with six RDTs. All RDTs exhibited some shared false-reactivity results with each of the seven other RDTs, with the exception of SD Bioline and Stat-Pak (Table 2).   For only one site, Conakry (Guinea), could we identify at least two RDTs to be used as a second or third test with sensitivity and specificity estimates of Ն99%, as recommended by WHO. Using the testing strategy for high-prevalence settings with Determine as the first test and these assays as second and third tests, the PPV of the algorithms ranged from 98.3% to 100% (Table 3). For three other sites (Douala, Cameroon; Kitgum, Uganda; Homa Bay, Kenya), only one test met the WHO criteria, necessitating the use of tests with a specificity of Ն98% as RDT2 and RDT3 and resulting in PPVs ranging from 92.7% to 100%. For the remaining two sites (Arua, Uganda; Baraka, Democratic Republic of Congo [DRC]), one test met the WHO criteria, but all others had specificities of Ͻ98%, necessitating the use of tests with specificities between 96% and 98%. The PPV of the resulting algorithms ranged from 75.8% to 99.6%. Detailed results are presented in Table 3.
Using the WHO strategy for low-prevalence settings, most simulated algorithms showed PPVs of Ն99%, even for the two sites (Arua, Uganda; Baraka, DRC) where tests with specificities between 96% and 98% were included in the algorithms ( Table 4). The proportion of inconclusive results remained low at Ͻ1% for most algorithms but rose to 2.5% at sites where tests with specificities between 96% and 98% were included in the algorithms. We also evaluated a simplified version of a reference algorithm, using a rapid test meeting criteria for RDT1 as a screening assay followed by a simple confirmatory assay. The PPVs of these algorithms ranged from 98.1% to 100%, with the proportions of inconclusive results ranging from 0% to 0.5% (Table 5).

DISCUSSION
WHO-recommended HIV testing strategies were developed based on models using theoretical RDTs with high sensitivity and specificity and no shared cross-reactivity. Here, we have used the results of a large multicenter evaluation of individual RDTs to estimate the performance of HIV testing algorithms using real data from six sub-Saharan African HTC sites. To our knowledge, this was the first study that evaluated the performance of algorithms based on the new WHO recommendations; all other such studies published to date focused on strategies using either two tests or three tests, with the third test used as a tiebreaker (7,9,11,(17)(18)(19)(20). Though WHO has never recommended the use of a tiebreaker due to the associated risk of generating falsepositive results, this strategy is still widely used (21). The use of several algorithms simulated here based on the strategy for highprevalence settings resulted in a PPV of Ͻ99%, even though RDTs with high specificity were used as second and third tests, due to shared falsely reactive results among the tests used. In particular, a general trend of shared falsely reactive results between Determine and Vikia could explain the finding that combinations using these two tests with samples from Conakry resulted in a suboptimal PPV of 98.3%, despite the fact that each test used at that site had an estimated specificity of Ն99%. Although we could not identify a similar trend of shared falsely reactive results between Determine and SD Bioline, the level of shared false reactivity was high with samples from Kitgum, leading to a PPV of only 92.7% for algorithms using these tests for Kitgum despite the acceptable specificity of SD Bioline (98.6%) on specimens from this site. A larger sample size is needed to investigate whether this represents a local phenomenon or a random occurrence. In the absence of reliable knowledge on the source of antigen preparations and of a good understanding of the mechanisms underlying falsely reactive results, only raw data from RDT evaluation studies using samples from local sites can provide the necessary information to avoid shared falsely reactive results.
For sites where only one test had a specificity of Ͼ99% and tests with specificities of 96% to 98% had to be included in the algorithms, the PPV of algorithms using the strategy for high-prevalence settings varied widely depending on the order of the second and third tests. In both sites (Arua, Uganda; Baraka, DRC), only algorithms using the highly specific test Stat-Pak as the second test reached or approached the threshold, while all other combinations gave PPVs below 95%. These results underscore the importance of the order of the RDTs in the algorithm and of using the test with the highest specificity as the second (and not third) test in employing a three-test strategy in the absence of two highly specific tests. The strategy recommended for low-prevalence settings, which requires three reactive RDTs to establish a diagnosis of HIV infection, generally resulted in algorithms with very high PPVs. For Baraka, DRC, where none of the high-prevalence algorithms achieved a PPV of Ն99%, this was the only strategy that reached the threshold. In addition, since this strategy classifies discordant results (e.g., RDT1-positive and RDT2negative [RDT1 ϩ RDT2 Ϫ ] results) as negative results, it is important to ensure that the negative predictive value (NPV), together with the PPV, is Ͼ99%, as was the case for the algorithms simulated here. This suggests that the low-prevalence-HIV testing strategy may be suitable for use not only in settings with low HIV prevalence but wherever HIV RDTs are known to have specificity issues.
We also propose a testing strategy that, similarly to a reference algorithm, relies on a sensitive screening assay followed by a simple confirmatory assay. One of these confirmatory assays, the ImmunoComb Combfirm, has shown good correlation with Western blotting in evaluations in the DRC and Ethiopia to confirm a two-RDT algorithm positive result but is no longer produced (11,22). Another option, the Geenius assay, has generally shown performance results sufficient for recommending it as an alternative to existing confirmatory assays such as Western blotting or immunoblotting (23)(24)(25)(26)(27)(28)(29). However, here we found that the use of these confirmatory assays did not consistently ensure PPVs of Ն99% in the different combinations tested, particularly for the two sites where RDTs showed high levels of false reactivity. Given the added complexity and cost of the Geenius confirmatory assay, we conclude that it does not compare favorably with the three-RDT combination recommended by WHO for use in these settings.
One of the limitations of this study was that Determine was used as the first assay in all algorithms that we simulated. We used Determine for the same reasons for which it is currently used as the first test in most algorithms: its relative low cost and very high sensitivity. Another limitation is that our sampling strategy underrepresented clients with negative results according to the onsite algorithm, resulting in a collection of specimens that was not representative of the population screened. To account for this verification bias, we conducted a weighted analysis aimed at mitigating its effect. The inclusion of all specimens with inconclusive results from onsite testing might also explain the high proportion of falsely reactive specimens in this study compared to other evaluations, including those for WHO prequalification. We believe, however, that these data reflect the reality of HIV testing at HTC sites. Nevertheless, although centralized testing in a reference laboratory had advantages for standardization and comparison of results, it had the disadvantage of not reproducing all aspects of field conditions. In particular, we could not reproduce repeat testing for clients with inconclusive results, which might have an impact on the final performance of these algorithms. Finally, we did not illustrate the use of these algorithms in low-prevalence settings, since all specimens came from sites that would be classified as high-prevalence sites. A simple calculation using the sensitivity and specificity reported here, together with the prevalence in the setting of interest, could provide useful information on the expected PPV for such settings. In addition, since most of the low-prevalence algorithms achieved a PPV of 100%, which would not be affected by HIV prevalence, our data support the use of the recommended strategy for these settings.
This attempt to illustrate the process and results of designing an HIV testing strategy using real data offers important lessons for navigating the various obstacles in the process. First, our data underscore the impact of shared false-reactivity results on the performance of algorithms and show that this phenomenon affects most RDT combi-nations to different degrees. More-transparent information from test manufacturers on possible shared false reactivity due to test rebranding or common sources of antigens is needed. Moreover, shared falsely reactive results from other studies performed using a standard panel for the evaluation of different assays would provide useful complementary information. Second, our results demonstrate that data from local evaluations are important for assessing diagnostic accuracy in the specific setting, although obtaining such information is often not feasible (30). We also highlight the importance of the order of tests, particularly in using the strategy for settings of high HIV prevalence, where the test with highest specificity should be used as the second rather than the third assay. Finally, if sufficient information is available and these steps are followed, good RDT-based HIV testing algorithms can be designed, though sometimes only with the strategy recommended for low-prevalence settings.  (16). Minimums of 220 positive and 220 negative specimens, as classified by the algorithm used on site, were prospectively collected as described previously (16). All frozen plasma samples were then sent to the AIDS reference laboratory at the Institute for Tropical Medicine (ITM), Antwerp, Belgium, for characterization with a standard reference algorithm (Fig. 1) and for testing with eight RDTs and two simple confirmatory assays.

MATERIALS AND METHODS
Reference method for HIV diagnosis. All plasma samples were tested at ITM using a fourthgeneration enzyme-linked immunosorbent assay (ELISA) detecting both antibodies and antigens (Vironostika HIV Uni-Form II Ag/Ab; bioMérieux, France) followed by a line immunoassay (LIA) (i.e., INNO-LIA HIV I/II Score; Innogenetics NV, Ghent, Belgium) and an antigen-enzyme immunoassay (Ag-EIA) (i.e., Innotest HIV Antigen MAb; Innogenetics NV, Ghent, Belgium) and in-house DNA PCR when applicable, as described for Fig. 1 (14). All tests were performed by six trained laboratory technicians. Each test was read by two technicians, each of whom was blind to the results reported by the other reader and to the reference standard result. When the two readers gave discordant results, a third reader was consulted to resolve the discrepancy. The details of the tests, as well as their performance per origin of specimens in our evaluation, are presented elsewhere (14).
Simulated algorithms. Results of the RDTs performed at ITM were used to construct simulated algorithms using the WHO-recommended testing strategies for high (Ն5%)-prevalence and low (Ͻ5%)prevalence settings, as described for Fig. 1A and B. We could not perform simulations of the repetition of the tests for discordant RDT1 ϩ RDT2 Ϫ results or retest 14 days later, as recommended by WHO. All simulations used RDT Determine as the first test. For RDT2 and RDT3, we selected all assays that met WHO recommendations, i.e., sensitivity of Ն99% and specificity of Ն99%, based on their individual performance estimates, compared to the reference algorithm, per origin of specimens (14). For sites where fewer than two tests met these criteria, we expanded the criteria to tests that had a specificity estimate of Ն98% or Ն96%. We also ensured that assays RDT2 and RDT3 had higher specificity than RDT1 in all the algorithms simulated here.
In addition, we simulated a testing strategy using an RDT as a screening test, followed by a simple confirmatory assay (Fig. 1C). For the screening test, we used all RDTs that met the WHO recommendations for the first assay, i.e., sensitivity of Ն99% and specificity of Ն98%.
Statistical analysis. STATA version 13.1 (StataCorp, College Station, Texas, USA) was used to carry out data analysis.
As for any performance evaluation, the results of the simulated algorithms were compared to those of the reference algorithm, considered the gold standard. We performed an inverse-probability weighted analysis to adjust for the initial sampling strategy, which underrepresented samples classified as negative by the onsite algorithm. For each participant, the weight was calculated as the inverse of the probability of inclusion in the study, i.e., as the total number of clients with similar onsite results during the study period divided by the number of included participants with similar results.
Since all tests included in this evaluation were antibody tests and were not expected to detect acute infections, we excluded samples classified as acute infections by the reference algorithm, i.e., positive with a fourth-generation EIA, negative or indeterminate with LIA, and positive with the antigen test ( Fig. 1). We also excluded from all analyses samples with indeterminate results by the reference algorithm. Samples with an inconclusive result by a specific simulated algorithm were excluded from the estimates of sensitivity and specificity and predictive values of that specific algorithm, and their number and proportion are reported separately.
Ethics. The study was approved by the Médecins sans Frontières (MSF) Ethical Review Board and by ethics committees in the five countries where the samples were collected. All participants provided written informed consent.