Human microbiome contains various microbial macromolecules with important biological functions. The Hidden Markov Models (HMMs) can overcome the problem of low similarity sequences with distant relationships and are widely implemented within various sequence alignment softwares. However, the HMM-based sequence alignments can generate a large number of results, how to quickly screen and batch extract target homologs from microbiomes is the major sticking points. It is necessary to develop an integrated gene filter and extraction pipeline to quickly and accurately screen homologs. Here, we introduced the HMMER-Extractor for amino acids or nucleotide sequences extraction, which was a supporting toolkit through provided filtering scores and an iterative keyword matching (IKM) logic. To make it more user-friendly and accessible, we further presented a visualized web server platform. An interactive HTML output provided a user-friendly way to browse homologous annotations and sequence extraction. The web server provided the community with a streamlined and user-friendly interface to analyze microbiomes. Through the HMMER-Extractor, we constructed a cardiovascular disease related gene dataset of the macromolecular metabolite trimethylamine (TMA) and lipopolysaccharide (LPS) based on 46,699 bacterial genomes from human gut. Approximately 21,014 and 1961 bacterial strains were identified to contain the cnt or cut operon of TMA, and the waa gene cluster of LPS, respectively. The Escherichia coli occupied the largest proportion among all the bacterial species, which belonged to the phyla Firmicutes. The HMMER-Extractor toolkit is an integrated pipeline and has been proven to be accurate and fast in extracting target macromolecular encoding genes from microbial genomes.
Keywords: Hidden Markov Models (HMMs); Keyword logic; Macromolecular metabolites; Orthologous genes; Threshold.
Copyright © 2024. Published by Elsevier B.V.