Background: Due to its late stage of diagnosis lung cancer is the commonest cause of death from cancer in the UK. Existing epidemiological risk models in clinical usage, which have Positive Predictive Values (PPV) of less than 10%, do not consider the temporal relations expressed in sequential electronic health record (EHR) data. We aimed to build a model for lung cancer early detection in primary care using machine learning with deep 'transformer' models on EHR data to learn from these complex sequential 'care pathways'.
Methods: We split the Whole Systems Integrated Care (WSIC) dataset into 70% training and 30% validation. Within the training set we created a case-control study with lung cancer cases and control cases of 'other' cancers or respiratory conditions or 'other' non cancer conditions. Based on 3,303,992 patients from January 1981 to December 2020 there were 11,847 lung cancer cases. 5789 cases and 7240 controls were used for training and 50,000 randomly selected patients out of the whole validation population of 368,906 for validation. GP EHR data going back three years from the date of diagnosis less the most recent one months were semantically pre-processed by mapping from more than 30,000 terms to 450. Model building was performed using ALBERT with a Logistic Regression Classifier (LRC) head. Clustering was explored using k-means. An additional regression model alone was built on the pre-processed data as a comparator.
Findings: Our model achieved an AUROC of 0.924 (95% CI 0.921-0.927) with a PPV of 3.6% (95% CI 3.5-3.7) and Sensitivity of 86.6% (95% CI 85.3-87.8) based on the three year's data prior to diagnosis less the immediate month before index diagnosis. The comparator regression model achieved a PPV of 3.1% (95% CI 3.0-3.1) and AUROC of 0.887 (95% CI 0.884-0.889). We interpreted our model using cluster analysis and have identified six groups of patients exhibiting similar lung cancer progression patterns and clinical investigation patterns.
Interpretation: Capturing temporal sequencing between cancer and non-cancer pathways to diagnosis enables much more accurate models. Future work will focus on external dataset validation and integration into GP clinical systems for evaluation.
Funding: Cancer Research UK.
Keywords: Artificial intelligence; Cancer prediction; Deep learning; Machine learning; Primary care; Transformers.
Copyright © 2024 The Authors. Published by Elsevier B.V. All rights reserved.