Use of electronic health record data and machine learning to identify candidates for HIV pre-exposure prophylaxis: a modelling study

Lancet HIV. 2019 Oct;6(10):e688-e695. doi: 10.1016/S2352-3018(19)30137-7. Epub 2019 Jul 5.

Abstract

Background: The limitations of existing HIV risk prediction tools are a barrier to implementation of pre-exposure prophylaxis (PrEP). We developed and validated an HIV prediction model to identify potential PrEP candidates in a large health-care system.

Methods: Our study population was HIV-uninfected adult members of Kaiser Permanente Northern California, a large integrated health-care system, who were not yet using PrEP and had at least 2 years of previous health plan enrolment with at least one outpatient visit from Jan 1, 2007, to Dec 31, 2017. Using 81 electronic health record (EHR) variables, we applied least absolute shrinkage and selection operator (LASSO) regression to predict incident HIV diagnosis within 3 years on a subset of patients who entered the cohort in 2007-14 (development dataset), assessing ten-fold cross-validated area under the receiver operating characteristic curve (AUC) and 95% CIs. We compared the full model to simpler models including only men who have sex with men (MSM) status and sexually transmitted infection (STI) positivity, testing, and treatment. Models were validated prospectively with data from an independent set of patients who entered the cohort in 2015-17. We computed predicted probabilities of incident HIV diagnosis within 3 years (risk scores), categorised as low risk (<0·05%), moderate risk (0·05% to <0·20%), high risk (0·20% to <1·0%), and very high risk (≥1·0%), for all patients in the validation dataset.

Findings: Of 3 750 664 patients in 2007-17 (3 143 963 in the development dataset and 606 701 in the validation dataset), there were 784 incident HIV cases within 3 years of baseline. The LASSO procedure retained 44 predictors in the full model, with an AUC of 0·86 (95% CI 0·85-0·87) for incident HIV cases in 2007-14. Model performance remained high in the validation dataset (AUC 0·84, 0·80-0·89). The full model outperformed simpler models including only MSM status and STI positivity. For the full model, flagging 13 463 (2·2%) patients with high or very high HIV risk scores in the validation dataset identified 32 (38·6%) of the 83 incident HIV cases, including 32 (46·4%) of 69 male cases and none of the 14 female cases. The full model had equivalent sensitivity by race whereas simpler models identified fewer black than white HIV cases.

Interpretation: Prediction models using EHR data can identify patients at high risk of HIV acquisition who could benefit from PrEP. Future studies should optimise EHR-based HIV risk prediction tools and evaluate their effect on prescription of PrEP.

Funding: Kaiser Permanente Community Benefit Research Program and the US National Institutes of Health.

Publication types

  • Evaluation Study
  • Research Support, N.I.H., Extramural
  • Research Support, Non-U.S. Gov't

MeSH terms

  • Adult
  • Anti-HIV Agents / therapeutic use
  • Cohort Studies
  • Electronic Health Records
  • Female
  • HIV Infections / prevention & control*
  • Homosexuality, Male
  • Humans
  • Machine Learning
  • Male
  • Middle Aged
  • Pre-Exposure Prophylaxis / methods*
  • Young Adult

Substances

  • Anti-HIV Agents