The prediction of organelle-targeting peptides in eukaryotic proteins with Grammatical-Restrained Hidden Conditional Random Fields

Bioinformatics. 2013 Apr 15;29(8):981-8. doi: 10.1093/bioinformatics/btt089. Epub 2013 Feb 21.

Abstract

Motivation: Targeting peptides are the most important signal controlling the import of nuclear encoded proteins into mitochondria and plastids. In the lack of experimental information, their prediction is an essential step when proteomes are annotated for inferring both the localization and the sequence of mature proteins.

Results: We developed TPpred a new predictor of organelle-targeting peptides based on Grammatical-Restrained Hidden Conditional Random Fields. TPpred is trained on a non-redundant dataset of proteins where the presence of a target peptide was experimentally validated, comprising 297 sequences. When tested on the 297 positive and some other 8010 negative examples, TPpred outperformed available methods in both accuracy and Matthews correlation index (96% and 0.58, respectively). Given its very low-false-positive rate (3.0%), TPpred is, therefore, well suited for large-scale analyses at the proteome level. We predicted that from ∼4 to 9% of the sequences of human, Arabidopsis thaliana and yeast proteomes contain targeting peptides and are, therefore, likely to be localized in mitochondria and plastids. TPpred predictions correlate to a good extent with the experimental annotation of the subcellular localization, when available. TPpred was also trained and tested to predict the cleavage site of the organelle-targeting peptide: on this task, the average error of TPpred on mitochondrial and plastidic proteins is 7 and 15 residues, respectively. This value is lower than the error reported by other methods currently available.

Availability: The TPpred datasets are available at http://biocomp.unibo.it/valentina/TPpred/. TPpred is available on request from the authors.

Supplementary information: Supplementary data are available at Bioinformatics online.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Arabidopsis Proteins / chemistry
  • Arabidopsis Proteins / metabolism
  • Chloroplast Proteins / chemistry*
  • Chloroplast Proteins / metabolism
  • Eukaryota
  • Humans
  • Mitochondria / metabolism
  • Mitochondrial Proteins / chemistry*
  • Mitochondrial Proteins / metabolism
  • Peptides / chemistry
  • Plastids / metabolism
  • Position-Specific Scoring Matrices
  • Protein Sorting Signals*
  • Proteome / chemistry
  • Proteome / metabolism
  • Saccharomyces cerevisiae Proteins / chemistry
  • Saccharomyces cerevisiae Proteins / metabolism
  • Sequence Analysis, Protein / methods*
  • Software

Substances

  • Arabidopsis Proteins
  • Chloroplast Proteins
  • Mitochondrial Proteins
  • Peptides
  • Protein Sorting Signals
  • Proteome
  • Saccharomyces cerevisiae Proteins