Annotation transfer between genomes: protein-protein interologs and protein-DNA regulogs

Genome Res. 2004 Jun;14(6):1107-18. doi: 10.1101/gr.1774904.

Abstract

Proteins function mainly through interactions, especially with DNA and other proteins. While some large-scale interaction networks are now available for a number of model organisms, their experimental generation remains difficult. Consequently, interolog mapping--the transfer of interaction annotation from one organism to another using comparative genomics--is of significant value. Here we quantitatively assess the degree to which interologs can be reliably transferred between species as a function of the sequence similarity of the corresponding interacting proteins. Using interaction information from Saccharomyces cerevisiae, Caenorhabditis elegans, Drosophila melanogaster, and Helicobacter pylori, we find that protein-protein interactions can be transferred when a pair of proteins has a joint sequence identity >80% or a joint E-value <10(-70). (These "joint" quantities are the geometric means of the identities or E-values for the two pairs of interacting proteins.) We generalize our interolog analysis to protein-DNA binding, finding such interactions are conserved at specific thresholds between 30% and 60% sequence identity depending on the protein family. Furthermore, we introduce the concept of a "regulog"--a conserved regulatory relationship between proteins across different species. We map interologs and regulogs from yeast to a number of genomes with limited experimental annotation (e.g., Arabidopsis thaliana) and make these available through an online database at http://interolog.gersteinlab.org. Specifically, we are able to transfer approximately 90,000 potential protein-protein interactions to the worm. We test a number of these in two-hybrid experiments and are able to verify 45 overlaps, which we show to be statistically significant.

Publication types

  • Comparative Study
  • Research Support, U.S. Gov't, P.H.S.

MeSH terms

  • Amino Acid Sequence / physiology
  • Animals
  • Bacterial Proteins / physiology
  • Binding Sites / physiology
  • Caenorhabditis elegans / genetics
  • Caenorhabditis elegans Proteins / physiology
  • Computational Biology / methods
  • Computational Biology / statistics & numerical data
  • Conserved Sequence / physiology
  • DNA / physiology*
  • DNA, Bacterial / physiology
  • DNA, Fungal / physiology
  • DNA, Helminth / physiology
  • DNA-Binding Proteins / physiology*
  • Databases, Protein
  • Drosophila Proteins / physiology
  • Drosophila melanogaster / genetics
  • Genome*
  • Genome, Bacterial*
  • Genome, Fungal*
  • Helicobacter pylori / genetics
  • Protein Binding / physiology
  • Protein Interaction Mapping / methods
  • Protein Interaction Mapping / statistics & numerical data
  • Proteins / physiology*
  • Saccharomyces cerevisiae / genetics
  • Saccharomyces cerevisiae Proteins / physiology
  • Sequence Homology, Amino Acid

Substances

  • Bacterial Proteins
  • Caenorhabditis elegans Proteins
  • DNA, Bacterial
  • DNA, Fungal
  • DNA, Helminth
  • DNA-Binding Proteins
  • Drosophila Proteins
  • Proteins
  • Saccharomyces cerevisiae Proteins
  • DNA