WORMHOLE: Novel Least Diverged Ortholog Prediction through Machine Learning

PLoS Comput Biol. 2016 Nov 3;12(11):e1005182. doi: 10.1371/journal.pcbi.1005182. eCollection 2016 Nov.

Abstract

The rapid advancement of technology in genomics and targeted genetic manipulation has made comparative biology an increasingly prominent strategy to model human disease processes. Predicting orthology relationships between species is a vital component of comparative biology. Dozens of strategies for predicting orthologs have been developed using combinations of gene and protein sequence, phylogenetic history, and functional interaction with progressively increasing accuracy. A relatively new class of orthology prediction strategies combines aspects of multiple methods into meta-tools, resulting in improved prediction performance. Here we present WORMHOLE, a novel ortholog prediction meta-tool that applies machine learning to integrate 17 distinct ortholog prediction algorithms to identify novel least diverged orthologs (LDOs) between 6 eukaryotic species-humans, mice, zebrafish, fruit flies, nematodes, and budding yeast. Machine learning allows WORMHOLE to intelligently incorporate predictions from a wide-spectrum of strategies in order to form aggregate predictions of LDOs with high confidence. In this study we demonstrate the performance of WORMHOLE across each combination of query and target species. We show that WORMHOLE is particularly adept at improving LDO prediction performance between distantly related species, expanding the pool of LDOs while maintaining low evolutionary distance and a high level of functional relatedness between genes in LDO pairs. We present extensive validation, including cross-validated prediction of PANTHER LDOs and evaluation of evolutionary divergence and functional similarity, and discuss future applications of machine learning in ortholog prediction. A WORMHOLE web tool has been developed and is available at http://wormhole.jax.org/.

MeSH terms

  • Algorithms*
  • Animals
  • Evolution, Molecular*
  • Genetic Speciation
  • Genetic Variation / genetics
  • High-Throughput Nucleotide Sequencing / methods*
  • Humans
  • Machine Learning*
  • Pattern Recognition, Automated / methods
  • Proteins / genetics*
  • Sequence Homology, Amino Acid*
  • Software

Substances

  • Proteins