Background: Prediction of hepatocellular carcinoma (HCC) development in persons with known risk factors remain a challenge and is an urgent unmet need, considering projected increases in HCC incidence and mortality in the US. We aimed to use machine learning techniques to identify a set of demographic, lifestyle, and health history information that can be used simultaneously for population-level HCC risk prediction.
Methods: Data from 377,065 participants of the NIH-AARP Diet and Health Study, among whom 647 developed HCC over 16 years of follow-up, were analyzed. The sample was randomly divided into independent training (60%) and validation (40%) sets. We evaluated 123 participant characteristics and tested 15 different machine learning algorithms for robustness in predicting HCC risk. Separately, we evaluated variables selected from multivariable logistic regression for risk prediction.
Results: The random under-sampling boosting (RUSBoost) algorithm performed best during model testing. Fourteen participant characteristics were selected for risk prediction based on differences between cases and controls (Bonferroni-corrected p-values <0.0004) and from the most frequently used variables in the initial two decision trees of the RUSBoost learner trees. A predictive model based on the 14 variables had an AUC of 0.72 (sensitivity=0.68, specificity=0.63) and independent validation AUC of 0.65 (sensitivity=0.68, specificity=0.63). A subset of 9 variables identified through logistic regression also had an AUC of 0.72 (sensitivity=0.67, specificity=0.63) and independent validation AUC of 0.65 (sensitivity=0.70, specificity=0.61).
Conclusion: Population-level HCC risk prediction can be performed with a machine learning-based algorithm and could inform strategies for improving HCC risk reduction in at-risk groups.
Keywords: HCC; hepatocellular carcinoma; liver cancer; machine learning; risk prediction.
© 2022 Thomas et al.