Sequence-based classification using discriminatory motif feature selection

Hao Xiong; Daniel Capurso; Saunak Sen; Mark R Segal

doi:10.1371/journal.pone.0027382

Sequence-based classification using discriminatory motif feature selection

PLoS One. 2011;6(11):e27382. doi: 10.1371/journal.pone.0027382. Epub 2011 Nov 10.

Authors

Hao Xiong¹, Daniel Capurso, Saunak Sen, Mark R Segal

Affiliation

¹ Department of Epidemiology and Biostatistics, University of California San Francisco, San Francisco, California, United States of America. hao@biostat.ucsf.edu

Abstract

Most existing methods for sequence-based classification use exhaustive feature generation, employing, for example, all k-mer patterns. The motivation behind such (enumerative) approaches is to minimize the potential for overlooking important features. However, there are shortcomings to this strategy. First, practical constraints limit the scope of exhaustive feature generation to patterns of length ≤ k, such that potentially important, longer (> k) predictors are not considered. Second, features so generated exhibit strong dependencies, which can complicate understanding of derived classification rules. Third, and most importantly, numerous irrelevant features are created. These concerns can compromise prediction and interpretation. While remedies have been proposed, they tend to be problem-specific and not broadly applicable. Here, we develop a generally applicable methodology, and an attendant software pipeline, that is predicated on discriminatory motif finding. In addition to the traditional training and validation partitions, our framework entails a third level of data partitioning, a discovery partition. A discriminatory motif finder is used on sequences and associated class labels in the discovery partition to yield a (small) set of features. These features are then used as inputs to a classifier in the training partition. Finally, performance assessment occurs on the validation partition. Important attributes of our approach are its modularity (any discriminatory motif finder and any classifier can be deployed) and its universality (all data, including sequences that are unaligned and/or of unequal length, can be accommodated). We illustrate our approach on two nucleosome occupancy datasets and a protein solubility dataset, previously analyzed using enumerative feature generation. Our method achieves excellent performance results, with and without optimization of classifier tuning parameters. A Python pipeline implementing the approach is available at http://www.epibiostat.ucsf.edu/biostat/sen/dmfs/.

Publication types

Research Support, N.I.H., Extramural

MeSH terms

Algorithms*
Amino Acid Motifs*
Humans
Nucleosomes / chemistry*
Nucleosomes / classification*
Proteins / chemistry*
Proteins / classification*
Sequence Alignment
Software

Substances

Nucleosomes
Proteins

Abstract

Publication types

MeSH terms

Substances

Grants and funding