A self-supervised deep learning method for data-efficient training in genomics

Hüseyin Anil Gündüz; Martin Binder; Xiao-Yin To; René Mreches; Bernd Bischl; Alice C McHardy; Philipp C Münch; Mina Rezaei

doi:10.1038/s42003-023-05310-2

A self-supervised deep learning method for data-efficient training in genomics

Commun Biol. 2023 Sep 11;6(1):928. doi: 10.1038/s42003-023-05310-2.

Authors

Hüseyin Anil Gündüz^{1

2}, Martin Binder^{1

2}, Xiao-Yin To^{1

2

3

4}, René Mreches^{3

4}, Bernd Bischl^{1

2}, Alice C McHardy^{3

4}, Philipp C Münch^{5

6

7

8}, Mina Rezaei^{9

10}

Affiliations

¹ Department of Statistics, LMU Munich, Munich, Germany.
² Munich Center for Machine Learning, Munich, Germany.
³ Department for Computational Biology of Infection Research, Helmholtz Center for Infection Research, 38124, Braunschweig, Germany.
⁴ Braunschweig Integrated Centre of Systems Biology (BRICS), Technische Universität Braunschweig, Braunschweig, Germany.
⁵ Department for Computational Biology of Infection Research, Helmholtz Center for Infection Research, 38124, Braunschweig, Germany. philipp.muench@helmholtz-hzi.de.
⁶ Braunschweig Integrated Centre of Systems Biology (BRICS), Technische Universität Braunschweig, Braunschweig, Germany. philipp.muench@helmholtz-hzi.de.
⁷ German Center for Infection Research (DZIF), partner site Hannover Braunschweig, Braunschweig, Germany. philipp.muench@helmholtz-hzi.de.
⁸ Department of Biostatistics, Harvard School of Public Health, Boston, MA, USA. philipp.muench@helmholtz-hzi.de.
⁹ Department of Statistics, LMU Munich, Munich, Germany. mina.rezaei@stat.uni-muenchen.de.
¹⁰ Munich Center for Machine Learning, Munich, Germany. mina.rezaei@stat.uni-muenchen.de.

Abstract

Deep learning in bioinformatics is often limited to problems where extensive amounts of labeled data are available for supervised classification. By exploiting unlabeled data, self-supervised learning techniques can improve the performance of machine learning models in the presence of limited labeled data. Although many self-supervised learning methods have been suggested before, they have failed to exploit the unique characteristics of genomic data. Therefore, we introduce Self-GenomeNet, a self-supervised learning technique that is custom-tailored for genomic data. Self-GenomeNet leverages reverse-complement sequences and effectively learns short- and long-term dependencies by predicting targets of different lengths. Self-GenomeNet performs better than other self-supervised methods in data-scarce genomic tasks and outperforms standard supervised training with ~10 times fewer labeled training data. Furthermore, the learned representations generalize well to new datasets and tasks. These findings suggest that Self-GenomeNet is well suited for large-scale, unlabeled genomic datasets and could substantially improve the performance of genomic models.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Computational Biology
Deep Learning*
Genomics
Machine Learning