Scalable and versatile container-based pipelines for de novo genome assembly and bacterial annotation

F1000Res. 2023 Sep 25:12:1205. doi: 10.12688/f1000research.139488.1. eCollection 2023.

Abstract

Background: Advancements in DNA sequencing technology have transformed the field of bacterial genomics, allowing for faster and more cost effective chromosome level assemblies compared to a decade ago. However, transforming raw reads into a complete genome model is a significant computational challenge due to the varying quality and quantity of data obtained from different sequencing instruments, as well as intrinsic characteristics of the genome and desired analyses. To address this issue, we have developed a set of container-based pipelines using Nextflow, offering both common workflows for inexperienced users and high levels of customization for experienced ones. Their processing strategies are adaptable based on the sequencing data type, and their modularity enables the incorporation of new components to address the community's evolving needs. Methods: These pipelines consist of three parts: quality control, de novo genome assembly, and bacterial genome annotation. In particular, the genome annotation pipeline provides a comprehensive overview of the genome, including standard gene prediction and functional inference, as well as predictions relevant to clinical applications such as virulence and resistance gene annotation, secondary metabolite detection, prophage and plasmid prediction, and more. Results: The annotation results are presented in reports, genome browsers, and a web-based application that enables users to explore and interact with the genome annotation results. Conclusions: Overall, our user-friendly pipelines offer a seamless integration of computational tools to facilitate routine bacterial genomics research. The effectiveness of these is illustrated by examining the sequencing data of a clinical sample of Klebsiella pneumoniae.

Keywords: antibiotic resistance; bacterial genomics; nextflow; pipelines; public health; virulence.

Publication types

  • Review

MeSH terms

  • Base Sequence
  • Genome, Bacterial*
  • Molecular Sequence Annotation
  • Sequence Analysis, DNA / methods
  • Software*

Grants and funding

This work was funded in part by a scholarship by Conselho Nacional de Desenvolvimento Científico e Tecnológico (CNPq) to FMA and by the grant number 806/2019 from Fundação de Amparo à Pesquisa do Distrito Federal (FAP-DF) to GPJ.