MODER2: first-order Markov modeling and discovery of monomeric and dimeric binding motifs

Jarkko Toivonen; Pratyush K Das; Jussi Taipale; Esko Ukkonen

doi:10.1093/bioinformatics/btaa045

MODER2: first-order Markov modeling and discovery of monomeric and dimeric binding motifs

Bioinformatics. 2020 May 1;36(9):2690-2696. doi: 10.1093/bioinformatics/btaa045.

Authors

Jarkko Toivonen¹, Pratyush K Das², Jussi Taipale^{3

4

5

6}, Esko Ukkonen¹

Affiliations

¹ Department of Computer Science, University of Helsinki, Helsinki FI-00014, Finland.
² Applied Tumor Genomics, Research Programs Unit, University of Helsinki, Helsinki FI-00014, Finland.
³ Department of Biochemistry, University of Cambridge, CB2 1GA Cambridge, UK.
⁴ Division of Functional Genomics and Systems Biology, Department of Medical Biochemistry and Biophysics, SE 141 83 Stockholm, Sweden.
⁵ Department of Biosciences and Nutrition, Karolinska Institutet, SE 141 83 Stockholm, Sweden.
⁶ Genome-Scale Biology Program, University of Helsinki, Helsinki FI-00014, Finland.

Abstract

Motivation: Position-specific probability matrices (PPMs, also called position-specific weight matrices) have been the dominating model for transcription factor (TF)-binding motifs in DNA. There is, however, increasing recent evidence of better performance of higher order models such as Markov models of order one, also called adjacent dinucleotide matrices (ADMs). ADMs can model dependencies between adjacent nucleotides, unlike PPMs. A modeling technique and software tool that would estimate such models simultaneously both for monomers and their dimers have been missing.

Results: We present an ADM-based mixture model for monomeric and dimeric TF-binding motifs and an expectation maximization algorithm MODER2 for learning such models from training data and seeds. The model is a mixture that includes monomers and dimers, built from the monomers, with a description of the dimeric structure (spacing, orientation). The technique is modular, meaning that the co-operative effect of dimerization is made explicit by evaluating the difference between expected and observed models. The model is validated using HT-SELEX and generated datasets, and by comparing to some earlier PPM and ADM techniques. The ADM models explain data slightly better than PPM models for 314 tested TFs (or their DNA-binding domains) from four families (bHLH, bZIP, ETS and Homeodomain), the ADM mixture models by MODER2 being the best on average.

Availability and implementation: Software implementation is available from https://github.com/jttoivon/moder2.

Supplementary information: Supplementary data are available at Bioinformatics online.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Algorithms
Binding Sites
Nucleotide Motifs
Position-Specific Scoring Matrices
Protein Binding
Software*
Transcription Factors* / genetics

Substances

Transcription Factors

Abstract

Publication types

MeSH terms

Substances

Grants and funding