Background: Type 2 diabetes disproportionately affects individuals of non-White ethnicity through a complex interaction of multiple factors. Therefore, early disease detection and prediction are essential and require tools that can be deployed on a large scale. We aimed to tackle this problem by developing questionnaire-based prediction models for type 2 diabetes prevalence and incidence for multiple ethnicities.
Methods: In this proof of principle analysis, logistic regression models to predict type 2 diabetes prevalence and incidence, using questionnaire-only variables reflecting health state and lifestyle, were trained on the White population of the UK Biobank (n = 472,696 total, aged 37-73 years, data collected 2006-2010) and validated in five other ethnicities (n = 29,811 total) and externally in Lifelines (n = 168,205 total, aged 0-93 years, collected between 2006 and 2013). In total, 631,748 individuals were included for prevalence prediction and 67,083 individuals for the eight-year incidence prediction. Type 2 diabetes prevalence in the UK Biobank ranged between 6% in the White population to 23.3% in the South Asian population, while in Lifelines, the prevalence was 1.9%. Predictive accuracy was evaluated using the area under the receiver operating characteristic curve (AUC), and a detailed sensitivity analysis was conducted to assess potential clinical utility. We compared the questionnaire-only models to models containing physical measurements and biomarkers as well as to clinical non-laboratory type 2 diabetes risk tools and conducted a reclassification analysis.
Findings: Our algorithms accurately predicted type 2 diabetes prevalence (AUC = 0.901) and eight-year incidence (AUC = 0.873) in the White UK Biobank population. Both models replicated well in the Lifelines external validation, with AUCs of 0.917 and 0.817 for prevalence and incidence, respectively. Both models performed consistently well across different ethnicities, with AUCs of 0.855-0.894 for prevalence and 0.819-0.883 for incidence. These models generally outperformed two clinically validated non-laboratory tools and correctly reclassified >3,000 additional cases. Model performance improved with the addition of blood biomarkers but not with the addition of physical measurements.
Interpretation: Our findings suggest that easy-to-implement, questionnaire-based models could be used to predict prevalent and incident type 2 diabetes with high accuracy across several ethnicities, providing a highly scalable solution for population-wide risk stratification. Future work should determine the effectiveness of these models in identifying undiagnosed type 2 diabetes, validated in cohorts of different populations and ethnic representation.
Funding: University Medical Center Groningen.
Keywords: Incidence; Machine learning; Population screening; Prediction; Prevalence; Type 2 diabetes.
© 2023 The Author(s).