Clustering high throughput biological data with B-MST, a minimum spanning tree based heuristic

Comput Biol Med. 2015 Jul:62:94-102. doi: 10.1016/j.compbiomed.2015.03.031. Epub 2015 Apr 14.

Abstract

To address important challenges in bioinformatics, high throughput data technologies are needed to interpret biological data efficiently and reliably. Clustering is widely used as a first step to interpreting high dimensional biological data, such as the gene expression data measured by microarrays. A good clustering algorithm should be efficient, reliable, and effective, as demonstrated by its capability of determining biologically relevant clusters. This paper proposes a new minimum spanning tree based heuristic B-MST, that is guided by an innovative objective function: the tightness and separation index (TSI). The TSI presented here obtains biologically meaningful clusters, making use of co-expression network topology, and this paper develops a local search procedure to minimize the TSI value. The proposed B-MST is tested by comparing results to: (1) adjusted rand index (ARI), for microarray data sets with known object classes, and (2) gene ontology (GO) annotations for data sets without documented object classes.

Keywords: Biological networks; Clustering; Gene expression data; Graph mining; Heuristics.

Publication types

  • Research Support, N.I.H., Extramural
  • Research Support, U.S. Gov't, Non-P.H.S.

MeSH terms

  • Computer Heuristics*
  • Electronic Data Processing / methods*
  • Gene Expression Regulation*
  • Gene Ontology*
  • Oligonucleotide Array Sequence Analysis*