CHOP proteins into structural domain-like fragments

Jinfeng Liu; Burkhard Rost

doi:10.1002/prot.20095

CHOP proteins into structural domain-like fragments

Proteins. 2004 May 15;55(3):678-88. doi: 10.1002/prot.20095.

Authors

Jinfeng Liu¹, Burkhard Rost

Affiliation

¹ CUBIC, Department of Biochemistry and Molecular Biophysics, Columbia University, New York, New York, USA.

PMID: 15103630
DOI: 10.1002/prot.20095

Abstract

We developed a method CHOP dissecting proteins into domain-like fragments. The basic idea was to cut proteins beginning from very reliable experimental information (PDB), proceeding to expert annotations of domain-like regions (Pfam-A), and completing through cuts based on termini of known proteins. In this way, CHOP dissected more than two thirds of all proteins from 62 proteomes. Analysis of our structural domain-like fragments revealed four surprising results. First, >70% of all dissected proteins contained more than one fragment. Second, most domains spanned on average over approximately 100 residues. This average was similar for eukaryotic and prokaryotic proteins, and it is also valid-although previously not described-for all proteins in the PDB. Third, single-domain proteins were significant longer than most domains in multidomain proteins. Fourth, three fourths of all domains appeared shorter than 210 residues. We believe that our CHOP fragments constituted an important resource for functional and structural genomics. Nevertheless, our main motivation to develop CHOP was that the single-linkage clustering method failed to adequately group full-length proteins. In contrast, CLUP-the simple clustering scheme CLUP introduced here-succeeded largely to group the CHOP fragments from 62 proteomes such that all members of one cluster shared a basic structural core. CLUP found >63,000 multi- and >118,000 single-member clusters. Although most fragments were restricted to a particular cluster, approximately 24% of the fragments were duplicated in at least two clusters. Our thresholds for grouping two fragments into the same cluster were rather conservative. Nevertheless, our results suggested that structural genomics initiatives have to target >30,000 fragments to at least cover the multimember clusters in 62 proteomes.

Publication types

Evaluation Study
Research Support, U.S. Gov't, P.H.S.

MeSH terms

Amino Acids / chemistry
Databases, Protein
Fungal Proteins / chemistry
Protein Structure, Tertiary*
Proteins / chemistry
Proteins / classification
Sequence Analysis, Protein / methods*
Sequence Homology, Amino Acid

Substances

Amino Acids
Fungal Proteins
Proteins

Grants and funding

P50 GM62413/GM/NIGMS NIH HHS/United States