MNBC: a multithreaded Minimizer-based Naïve Bayes Classifier for improved metagenomic sequence classification

Bioinformatics. 2024 Oct 1;40(10):btae601. doi: 10.1093/bioinformatics/btae601.

Abstract

Motivation: State-of-the-art tools for classifying metagenomic sequencing reads provide both rapid and accurate options, although the combination of both in a single tool is a constantly improving area of research. The machine learning-based Naïve Bayes Classifier (NBC) approach provides a theoretical basis for accurate classification of all reads in a sample.

Results: We developed the multithreaded Minimizer-based Naïve Bayes Classifier (MNBC) tool to improve the NBC approach by applying minimizers, as well as plurality voting for closely related classification scores. A standard reference- and test-sequence framework using simulated variable-length reads benchmarked MNBC with six other state-of-the-art tools: MetaMaps, Ganon, Kraken2, KrakenUniq, CLARK, and Centrifuge. We also applied MNBC to the "marine" and "strain-madness" short-read metagenomic datasets in the Critical Assessment of Metagenome Interpretation (CAMI) II challenge using a corresponding database from the time. MNBC efficiently identified reads from unknown microorganisms, and exhibited the highest species- and genus-level precision and recall on short reads, as well as the highest species-level precision on long reads. It also achieved the highest accuracy on the "strain-madness" dataset.

Availability and implementation: MNBC is freely available at: https://github.com/ComputationalPathogens/MNBC.

MeSH terms

  • Algorithms
  • Bayes Theorem*
  • Machine Learning
  • Metagenome
  • Metagenomics* / methods
  • Sequence Analysis, DNA / methods
  • Software*