Identifying representative trees from ensembles

Stat Med. 2012 Jul 10;31(15):1601-16. doi: 10.1002/sim.4492. Epub 2012 Feb 3.

Abstract

Tree-based methods have become popular for analyzing complex data structures where the primary goal is risk stratification of patients. Ensemble techniques improve the accuracy in prediction and address the instability in a single tree by growing an ensemble of trees and aggregating. However, in the process, individual trees get lost. In this paper, we propose a methodology for identifying the most representative trees in an ensemble on the basis of several tree distance metrics. Although our focus is on binary outcomes, the methods are applicable to censored data as well. For any two trees, the distance metrics are chosen to (1) measure similarity of the covariates used to split the trees; (2) reflect similar clustering of patients in the terminal nodes of the trees; and (3) measure similarity in predictions from the two trees. Whereas the latter focuses on prediction, the first two metrics focus on the architectural similarity between two trees. The most representative trees in the ensemble are chosen on the basis of the average distance between a tree and all other trees in the ensemble. Out-of-bag estimate of error rate is obtained using neighborhoods of representative trees. Simulations and data examples show gains in predictive accuracy when averaging over such neighborhoods. We illustrate our methods using a dataset of kidney cancer treatment receipt (binary outcome) and a second dataset of breast cancer survival (censored outcome).

Publication types

  • Research Support, N.I.H., Extramural

MeSH terms

  • Breast Neoplasms / epidemiology
  • Breast Neoplasms / pathology
  • Breast Neoplasms / therapy
  • Computer Simulation
  • Data Interpretation, Statistical
  • Female
  • Forecasting / methods
  • Humans
  • Kidney Neoplasms / epidemiology
  • Kidney Neoplasms / therapy
  • Male
  • Models, Statistical*
  • Outcome and Process Assessment, Health Care / methods
  • Outcome and Process Assessment, Health Care / statistics & numerical data*
  • Risk Assessment / methods
  • Risk Assessment / statistics & numerical data*
  • Survival Analysis*