An independent, multi-country head-to-head accuracy comparison of automated chest x-ray algorithms for the triage of pulmonary tuberculosis

medRxiv [Preprint]. 2024 Jun 19:2024.06.19.24309061. doi: 10.1101/2024.06.19.24309061.

Abstract

Background: Computer-aided detection (CAD) algorithms for automated chest X-ray (CXR) reading have been endorsed by the World Health Organization for tuberculosis (TB) triage, but independent, multi-country assessment and comparison of current products are needed to guide implementation.

Methods: We conducted a head-to-head evaluation of five CAD algorithms for TB triage across seven countries. We included CXRs from adults who presented to outpatient facilities with at least two weeks of cough in India, Madagascar, the Philippines, South Africa, Tanzania, Uganda, and Vietnam. The participants completed a standard evaluation for pulmonary TB, including sputum collection for Xpert MTB/RIF Ultra and culture. Against a microbiological reference standard, we calculated and compared the accuracy overall, by country and key groups for five CAD algorithms: CAD4TB (Delft Imaging), INSIGHT CXR (Lunit), DrAid (Vinbrain), Genki (Deeptek), and qXR (qure.AI). We determined the area under the ROC curve (AUC) and if any CAD product could achieve the minimum target accuracy for a TB triage test (≥90% sensitivity and ≥70% specificity). We then applied country- and population-specific thresholds and recalculated accuracy to assess any improvement in performance.

Results: Of 3,927 individuals included, the median age was 41 years (IQR 29-54), 12.9% were people living with HIV (PLWH), 8.2% living with diabetes, and 21.2% had a prior history of TB. The overall AUC ranged from 0.774-0.819, and specificity ranged from 64.8-73.8% at 90% sensitivity. CAD4TB had the highest overall accuracy (73.8% specific, 95% CI 72.2-75.4, at 90% sensitivity), although qXR and INSIGHT CXR also achieved the target 70% specificity. There was heterogeneity in accuracy by country, and females and PLWH had lower sensitivity while males and people with a history of TB had lower specificity. The performance remained stable regardless of diabetes status. When country- and population-specific thresholds were applied, at least one CAD product could achieve or approach the target accuracy for each country and sub-group, except for PLWH and those with a history of TB.

Conclusions: Multiple CAD algorithms can achieve or exceed the minimum target accuracy for a TB triage test, with improvement when using setting- or population-specific thresholds. Further efforts are needed to integrate CAD into routine TB case detection programs in high-burden communities.

Publication types

  • Preprint