CSA: A high-throughput chromosome-scale assembly pipeline for vertebrate genomes

Heiner Kuhl; Ling Li; Sven Wuertz; Matthias Stöck; Xu-Fang Liang; Christophe Klopp

doi:10.1093/gigascience/giaa034

CSA: A high-throughput chromosome-scale assembly pipeline for vertebrate genomes

Gigascience. 2020 May 1;9(5):giaa034. doi: 10.1093/gigascience/giaa034.

Authors

Heiner Kuhl¹, Ling Li^{1

2}, Sven Wuertz¹, Matthias Stöck¹, Xu-Fang Liang², Christophe Klopp³

Affiliations

¹ Department of Ecophysiology and Aquaculture, Leibniz-Institute of Freshwater Ecology and Inland Fisheries (IGB), Müggelseedamm 310, 12587 Berlin, Germany.
² College of Fisheries, Chinese Perch Research Center, Huazhong Agricultural University; Innovation Base for Chinese Perch Breeding, Key Lab of Freshwater Animal Breeding, Ministry of Agriculture, No.1 Shizishan Street, Hongshan District, 430070 Wuhan, Hubei Province, P.R. China.
³ Sigenae, Bioinfo Genotoul, Mathématiques et Informatique Appliquées de Toulouse, INRAe, 24 Chemin de Borde Rouge, 31320 Auzeville-Tolosane, Castanet Tolosan, France.

Abstract

Background: Easy-to-use and fast bioinformatics pipelines for long-read assembly that go beyond the contig level to generate highly continuous chromosome-scale genomes from raw data remain scarce.

Result: Chromosome-Scale Assembler (CSA) is a novel computationally highly efficient bioinformatics pipeline that fills this gap. CSA integrates information from scaffolded assemblies (e.g., Hi-C or 10X Genomics) or even from diverged reference genomes into the assembly process. As CSA performs automated assembly of chromosome-sized scaffolds, we benchmark its performance against state-of-the-art reference genomes, i.e., conventionally built in a laborious fashion using multiple separate assembly tools and manual curation. CSA increases the contig lengths using scaffolding, local re-assembly, and gap closing. On certain datasets, initial contig N50 may be increased up to 4.5-fold. For smaller vertebrate genomes, chromosome-scale assemblies can be achieved within 12 h using low-cost, high-end desktop computers. Mammalian genomes can be processed within 16 h on compute-servers. Using diverged reference genomes for fish, birds, and mammals, we demonstrate that CSA calculates chromosome-scale assemblies from long-read data and genome comparisons alone. Even contig-level draft assemblies of diverged genomes are helpful for reconstructing chromosome-scale sequences. CSA is also capable of assembling ultra-long reads.

Conclusions: CSA can speed up and simplify chromosome-level assembly and significantly lower costs of large-scale family-level vertebrate genome projects.

Keywords: chromosomes; comparative genomics; genome assembly; genome evolution; genome scaffolding; long-read; vertebrates.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Animals
Chromosomes / genetics*
Computational Biology / methods*
Genomics / methods*
High-Throughput Nucleotide Sequencing
Sequence Analysis, DNA
Software*
Synteny
Vertebrates / metabolism*