The volume of biomedical data available to the machine learning community grows very rapidly. A rational question is how informative these data really are or how discriminant the features describing the data instances are. Several biomedical datasets suffer from lack of variance in the instance representation, or even worse, contain instances with identical features and different class labels. Indisputably, this directly affects the performance of machine learning algorithms, as well as the ability to interpret their results. In this article, we emphasize on the aforementioned problem and propose a target-informed feature induction method based on tree ensemble learning. The method brings more variance into the data representation, thereby potentially increasing predictive performance of a learner applied to the induced features. The contribution of this article is twofold. Firstly, a problem affecting the quality of biomedical data is highlighted, and secondly, a method to handle that problem is proposed. The efficiency of the presented approach is validated on multi-target prediction tasks. The obtained results indicate that the proposed approach is able to boost the discrimination between the data instances and increase the predictive performance.
Keywords: Biomedical data mining; Extremely randomized trees; Tree-embeddings; Tree-ensembles.
Copyright © 2018 Elsevier Inc. All rights reserved.