A feature vector integration approach for a generalized support vector machine pairwise homology algorithm

Comput Biol Chem. 2008 Dec;32(6):458-61. doi: 10.1016/j.compbiolchem.2008.07.017. Epub 2008 Jul 16.

Abstract

Due to the exponential growth of sequenced genomes, the need to quickly provide accurate annotation for existing and new sequences is paramount to facilitate biological research. Current sequence comparison approaches fail to detect homologous relationships when sequence similarity is low. Support vector machine (SVM) algorithms approach this problem by transforming all proteins into a feature space of equal dimension based on protein properties, such as sequence similarity scores against a basis set of proteins or motifs. This multivariate representation of the protein space is then used to build a classifier specific to a pre-defined protein family. However, this approach is not well suited to large-scale annotation. We have developed a SVM approach that formulates remote homology as a single classifier that answers the pairwise comparison problem by integrating the two feature vectors for a pair of sequences into a single vector representation that can be used to build a classifier that separates sequence pairs into homologs and non-homologs. This pairwise SVM approach significantly improves the task of remote homology detection on the benchmark dataset, quantified as the area under the receiver operating characteristic curve; 0.97 versus 0.73 and 0.70 for PSI-BLAST and Basic Local Alignment Search Tool (BLAST), respectively.

Publication types

  • Research Support, U.S. Gov't, Non-P.H.S.

MeSH terms

  • Algorithms*
  • Databases, Protein
  • Sequence Homology, Amino Acid*