A pancreatic cancer risk prediction model (Prism) developed and validated on large-scale US clinical data

EBioMedicine. 2023 Dec:98:104888. doi: 10.1016/j.ebiom.2023.104888. Epub 2023 Nov 25.

Abstract

Background: Pancreatic Duct Adenocarcinoma (PDAC) screening can enable early-stage disease detection and long-term survival. Current guidelines use inherited predisposition, with about 10% of PDAC cases eligible for screening. Using Electronic Health Record (EHR) data from a multi-institutional federated network, we developed and validated a PDAC RISk Model (Prism) for the general US population to extend early PDAC detection.

Methods: Neural Network (PrismNN) and Logistic Regression (PrismLR) were developed using EHR data from 55 US Health Care Organisations (HCOs) to predict PDAC risk 6-18 months before diagnosis for patients 40 years or older. Model performance was assessed using Area Under the Curve (AUC) and calibration plots. Models were internal-externally validated by geographic location, race, and time. Simulated model deployment evaluated Standardised Incidence Ratio (SIR) and other metrics.

Findings: With 35,387 PDAC cases, 1,500,081 controls, and 87 features per patient, PrismNN obtained a test AUC of 0.826 (95% CI: 0.824-0.828) (PrismLR: 0.800 (95% CI: 0.798-0.802)). PrismNN's average internal-external validation AUCs were 0.740 for locations, 0.828 for races, and 0.789 (95% CI: 0.762-0.816) for time. At SIR = 5.10 (exceeding the current screening inclusion threshold) in simulated model deployment, PrismNN sensitivity was 35.9% (specificity 95.3%).

Interpretation: Prism models demonstrated good accuracy and generalizability across diverse populations. PrismNN could find 3.5 times more cases at comparable risk than current screening guidelines. The small number of features provided a basis for model interpretation. Integration with the federated network provided data from a large, heterogeneous patient population and a pathway to future clinical deployment.

Funding: Prevent Cancer Foundation, TriNetX, Boeing, DARPA, NSF, and Aarno Labs.

Keywords: Electronic health records; Federated data; Machine learning; Pancreatic cancer; Risk prediction.

MeSH terms

  • Carcinoma, Pancreatic Ductal* / pathology
  • Humans
  • Logistic Models
  • Multicenter Studies as Topic
  • Pancreatic Neoplasms* / diagnosis
  • Pancreatic Neoplasms* / epidemiology
  • Pancreatic Neoplasms* / etiology
  • Retrospective Studies