Improving Protein Secondary Structure Prediction by Deep Language Models and Transformer Networks

Tianqi Wu; Weihang Cheng; Jianlin Cheng

doi:10.1007/978-1-0716-4196-5_3

Improving Protein Secondary Structure Prediction by Deep Language Models and Transformer Networks

Methods Mol Biol. 2025:2867:43-53. doi: 10.1007/978-1-0716-4196-5_3.

Authors

Tianqi Wu¹, Weihang Cheng², Jianlin Cheng³

Affiliations

¹ Electrical Engineering and Computer Science Department, University of Missouri, Columbia, MO, USA.
² Department of Chemistry, Hubei University, Wuhan, Hubei, China.
³ Electrical Engineering and Computer Science Department, University of Missouri, Columbia, MO, USA. chengji@missouri.edu.

PMID: 39576574
DOI: 10.1007/978-1-0716-4196-5_3

Abstract

Protein secondary structure prediction is useful for many applications. It can be considered a language translation problem, that is, translating a sequence of 20 different amino acids into a sequence of secondary structure symbols (e.g., alpha helix, beta strand, and coil). Here, we develop a novel protein secondary structure predictor called TransPross based on the transformer network and attention mechanism widely used in natural language processing to directly extract the evolutionary information from the protein language (i.e., raw multiple sequence alignment [MSA] of a protein) to predict the secondary structure. The method is different from traditional methods that first generate a MSA and then calculate expert-curated statistical profiles from the MSA as input. The attention mechanism used by TransPross can effectively capture long-range residue-residue interactions in protein sequences to predict secondary structures. Benchmarked on several datasets, TransPross outperforms the state-of-art methods. Moreover, our experiment shows that the prediction accuracy of TransPross positively correlates with the depth of MSAs, and it is able to achieve the average prediction accuracy (i.e., Q3 score) above 80% for hard targets with few homologous sequences in their MSAs. TransPross is freely available at https://github.com/BioinfoMachineLearning/TransPro .

Keywords: Attention; Natural language processing; Protein secondary structure prediction; Transformer.

MeSH terms

Algorithms
Amino Acid Sequence
Computational Biology* / methods
Databases, Protein
Deep Learning
Protein Structure, Secondary*
Proteins* / chemistry
Sequence Alignment / methods
Software

Substances

Proteins