COVID-19 from symptoms to prediction: A statistical and machine learning approach

Comput Biol Med. 2024 Nov:182:109211. doi: 10.1016/j.compbiomed.2024.109211. Epub 2024 Sep 28.

Abstract

During the COVID-19 pandemic, the analysis of patient data has become a cornerstone for developing effective public health strategies. This study leverages a dataset comprising over 10,000 anonymized patient records from various leading medical institutions to predict COVID-19 patient age groups using a suite of statistical and machine learning techniques. Initially, extensive statistical tests including ANOVA and t-tests were utilized to assess relationships among demographic and symptomatic variables. The study then employed machine learning models such as Decision Tree, Naïve Bayes, KNN, Gradient Boosted Trees, Support Vector Machine, and Random Forest, with rigorous data preprocessing to enhance model accuracy. Further improvements were sought through ensemble methods; bagging, boosting, and stacking. Our findings indicate strong associations between key symptoms and patient age groups, with ensemble methods significantly enhancing model accuracy. Specifically, stacking applied with random forest as a meta leaner exhibited the highest accuracy (0.7054). In addition, the implementation of stacking techniques notably improved the performance of K-Nearest Neighbors (from 0.529 to 0.63) and Naïve Bayes (from 0.554 to 0.622) and demonstrated the most successful prediction method. The study aimed to understand the number of symptoms identified in COVID-19 patients and their association with different age groups. The results can assist doctors and higher authorities in improving treatment strategies. Additionally, several decision-making techniques can be applied during pandemic, tailored to specific age groups, such as resource allocation, medicine availability, vaccine development, and treatment strategies. The integration of these predictive models into clinical settings could support real-time public health responses and targeted intervention strategies.

Keywords: COVID-19; Ensemble algorithms; Machine learning; Predictive models; Public health informatics; Statistical analysis.

MeSH terms

  • Adolescent
  • Adult
  • Aged
  • Aged, 80 and over
  • Bayes Theorem
  • COVID-19* / epidemiology
  • Child
  • Child, Preschool
  • Female
  • Humans
  • Machine Learning*
  • Male
  • Middle Aged
  • Models, Statistical
  • Pandemics
  • SARS-CoV-2*