Cost-Constrained feature selection in binary classification: adaptations for greedy forward selection and genetic algorithms

Rudolf Jagdhuber; Michel Lang; Arnulf Stenzl; Jochen Neuhaus; Jörg Rahnenführer

doi:10.1186/s12859-020-3361-9

Cost-Constrained feature selection in binary classification: adaptations for greedy forward selection and genetic algorithms

BMC Bioinformatics. 2020 Jan 28;21(1):26. doi: 10.1186/s12859-020-3361-9.

Authors

Rudolf Jagdhuber^{1

2}, Michel Lang¹, Arnulf Stenzl³, Jochen Neuhaus⁴, Jörg Rahnenführer⁵

Affiliations

¹ Department of Statistics, TU Dortmund, Vogelpothsweg 87, Dortmund, 44227, Germany.
² numares AG, Am BioPark 9, Regensburg, 93053, Germany.
³ Klinik für Urologie, Universitätsklinikum Tübingen, Hoppe-Seyler-Str. 3, Tübingen, 72076, Germany.
⁴ Universitätsklinikum Leipzig AöR, Department für Operative Medizin, Klinik und Poliklinik für Urologie, Liebigstr. 20, Leipzig, 04103, Germany.
⁵ Department of Statistics, TU Dortmund, Vogelpothsweg 87, Dortmund, 44227, Germany. rahnenfuehrer@statistik.tu-dortmund.de.

Abstract

Background: With modern methods in biotechnology, the search for biomarkers has advanced to a challenging statistical task exploring high dimensional data sets. Feature selection is a widely researched preprocessing step to handle huge numbers of biomarker candidates and has special importance for the analysis of biomedical data. Such data sets often include many input features not related to the diagnostic or therapeutic target variable. A less researched, but also relevant aspect for medical applications are costs of different biomarker candidates. These costs are often financial costs, but can also refer to other aspects, for example the decision between a painful biopsy marker and a simple urine test. In this paper, we propose extensions to two feature selection methods to control the total amount of such costs: greedy forward selection and genetic algorithms. In comprehensive simulation studies of binary classification tasks, we compare the predictive performance, the run-time and the detection rate of relevant features for the new proposed methods and five baseline alternatives to handle budget constraints.

Results: In simulations with a predefined budget constraint, our proposed methods outperform the baseline alternatives, with just minor differences between them. Only in the scenario without an actual budget constraint, our adapted greedy forward selection approach showed a clear drop in performance compared to the other methods. However, introducing a hyperparameter to adapt the benefit-cost trade-off in this method could overcome this weakness.

Conclusions: In feature cost scenarios, where a total budget has to be met, common feature selection algorithms are often not suitable to identify well performing subsets for a modelling task. Adaptations of these algorithms such as the ones proposed in this paper can help to tackle this problem.

Keywords: Budget constraint; Cost limit; Feature cost; Feature selection; Genetic algorithm.

Cost-Constrained feature selection in binary classification: adaptations for greedy forward selection and genetic algorithms

Authors

Affiliations

Abstract

MeSH terms

Substances

Grants and funding