Prediction of pre-eclampsia with machine learning approaches: Leveraging important information from routinely collected data

Int J Med Inform. 2024 Dec:192:105645. doi: 10.1016/j.ijmedinf.2024.105645. Epub 2024 Oct 5.

Abstract

Background: Globally, pre-eclampsia (PE) is a leading cause of maternal and perinatal morbidity and mortality. PE prediction using routinely collected data has the advantage of being widely applicable, particularly in low-resource settings. Early intervention for high-risk women might reduce PE incidence and related complications. We aimed to replicate our machine learning (ML) published work predicting another maternal condition (gestational diabetes) to (1) predict PE using routine health data, (2) identify the optimal ML model, and (3) compare it with logistic regression approach.

Methods: Data were from a large health service network with 48,250 singleton pregnancies between January 2016 and June 2021. Supervised ML models were employed. Maternal clinical and medical characteristics were the feature variables (predictors), and a 70/30 data split was used for training and testing the model. Predictive performance was assessed using area under the curve (AUC) and calibration plots. Shapley value analysis assessed the contribution of feature variables.

Results: The random forest approach provided excellent discrimination with an AUC of 0.84 (95% CI: 0.82-0.86) and highest prediction accuracy (0.79); however, the calibration curve (slope of 1.21, 95% CI 1.13-1.30) was acceptable only for a threshold of 0.3 or less. The next best approach was extreme gradient boosting, which provided an AUC of 0.77 (95% CI: 0.76-0.79) and well-calibrated (slope of 0.93, 95% CI 0.85-1.01). Logistic regression provided good discrimination performance with an AUC of 0.75 (95% CI: 0.74-0.76) and perfect calibration. Nulliparous, pre-pregnancy body mass index, previous pregnancy with prior PE, maternal age, family history of hypertension, and pre-existing hypertension and diabetes were the top-ranked features in Shapley value analysis.

Conclusion: Two ML models created the highest-performing prediction using routinely collected data to identify women at high risk of PE, with acceptable discrimination. However, to confirm this result and also examine model generalisability, external validation studies are needed in other settings, utilising standardised prognostic factors.

Keywords: Computer-aided medical decision support; Gradient Boosting; Learning health system; Machine-learning; Pre-eclampsia; Prediction; Prognostic; Random forest; Routine health data; XGBoost.

MeSH terms

  • Adult
  • Area Under Curve
  • Diabetes, Gestational / diagnosis
  • Diabetes, Gestational / epidemiology
  • Female
  • Humans
  • Logistic Models
  • Machine Learning*
  • Pre-Eclampsia* / diagnosis
  • Pre-Eclampsia* / epidemiology
  • Pregnancy
  • Risk Factors