Predicting protein function and other biomedical characteristics with heterogeneous ensembles

Methods. 2016 Jan 15:93:92-102. doi: 10.1016/j.ymeth.2015.08.016. Epub 2015 Sep 2.

Abstract

Prediction problems in biomedical sciences, including protein function prediction (PFP), are generally quite difficult. This is due in part to incomplete knowledge of the cellular phenomenon of interest, the appropriateness and data quality of the variables and measurements used for prediction, as well as a lack of consensus regarding the ideal predictor for specific problems. In such scenarios, a powerful approach to improving prediction performance is to construct heterogeneous ensemble predictors that combine the output of diverse individual predictors that capture complementary aspects of the problems and/or datasets. In this paper, we demonstrate the potential of such heterogeneous ensembles, derived from stacking and ensemble selection methods, for addressing PFP and other similar biomedical prediction problems. Deeper analysis of these results shows that the superior predictive ability of these methods, especially stacking, can be attributed to their attention to the following aspects of the ensemble learning process: (i) better balance of diversity and performance, (ii) more effective calibration of outputs and (iii) more robust incorporation of additional base predictors. Finally, to make the effective application of heterogeneous ensembles to large complex datasets (big data) feasible, we present DataSink, a distributed ensemble learning framework, and demonstrate its sound scalability using the examined datasets. DataSink is publicly available from https://github.com/shwhalen/datasink.

Keywords: Distributed machine learning; Diversity-performance tradeoff; Ensemble calibration; Heterogeneous ensembles; Nested cross-validation; Protein function prediction.

Publication types

  • Research Support, N.I.H., Extramural

MeSH terms

  • Algorithms
  • Databases, Protein*
  • Forecasting
  • Machine Learning*
  • Proteins / physiology*

Substances

  • Proteins