Fully automated sequence alignment methods are comparable to, and much faster than, traditional methods in large data sets: an example with hepatitis B virus

Therese A Catanach; Andrew D Sweet; Nam-Phuong D Nguyen; Rhiannon M Peery; Andrew H Debevec; Andrea K Thomer; Amanda C Owings; Bret M Boyd; Aron D Katz; Felipe N Soto-Adames; Julie M Allen

doi:10.7717/peerj.6142

Fully automated sequence alignment methods are comparable to, and much faster than, traditional methods in large data sets: an example with hepatitis B virus

PeerJ. 2019 Jan 3:7:e6142. doi: 10.7717/peerj.6142. eCollection 2019.

Authors

Therese A Catanach^#^{1

2

3}, Andrew D Sweet^#^{2

4}, Nam-Phuong D Nguyen⁵, Rhiannon M Peery^{6

7}, Andrew H Debevec⁸, Andrea K Thomer⁹, Amanda C Owings¹⁰, Bret M Boyd^{2

11}, Aron D Katz^{2

12}, Felipe N Soto-Adames^{13

14}, Julie M Allen¹⁵

Affiliations

¹ Ornithology Department, Academy of Natural Sciences of Drexel University, Philadelphia, PA, United States of America.
² Illinois Natural History Survey, University of Illinois at Urbana-Champaign, Champaign, IL, United States of America.
³ Department of Wildlife and Fisheries Sciences, Texas A&M University, College Station, TX, United States of America.
⁴ Department of Entomology, Purdue University, West Lafayette, IN, United States of America.
⁵ Computer Science and Engineering, University of San Diego, California, La Jolla, CA, United States of America.
⁶ Department of Biology, University of Alberta, Edmonton, Alberta, Canada.
⁷ Department of Plant Biology, University of Illinois at Urbana-Champaign, Champaign, IL, United States of America.
⁸ School of Integrative Biology, University of Illinois at Urbana-Champaign, Champaign, IL, United States of America.
⁹ School of Information, University of Michigan-Ann Arbor, Ann Arbor, MI, United States of America.
¹⁰ Program in Ecology, Evolution, and Conservation Biology, University of Illinois at Urbana-Champaign, Urbana, IL, United States of America.
¹¹ Department of Entomology, University of Georga, Athens, GA, United States of America.
¹² Department of Entomology, University of Illinois at Urbana-Champaign, Champaign, IL, United States of America.
¹³ Florida State Collection of Arthropods, Florida Department of Agriculture and Consumer Services, Gainesville, FL, United States of America.
¹⁴ Department of Entomology and Nematology, University of Florida, Gainesville, FL, United States of America.
¹⁵ Biology Department, University of Nevada, Reno, Reno, NV, United States of America.

^# Contributed equally.

Abstract

Aligning sequences for phylogenetic analysis (multiple sequence alignment; MSA) is an important, but increasingly computationally expensive step with the recent surge in DNA sequence data. Much of this sequence data is publicly available, but can be extremely fragmentary (i.e., a combination of full genomes and genomic fragments), which can compound the computational issues related to MSA. Traditionally, alignments are produced with automated algorithms and then checked and/or corrected "by eye" prior to phylogenetic inference. However, this manual curation is inefficient at the data scales required of modern phylogenetics and results in alignments that are not reproducible. Recently, methods have been developed for fully automating alignments of large data sets, but it is unclear if these methods produce alignments that result in compatible phylogenies when compared to more traditional alignment approaches that combined automated and manual methods. Here we use approximately 33,000 publicly available sequences from the hepatitis B virus (HBV), a globally distributed and rapidly evolving virus, to compare different alignment approaches. Using one data set comprised exclusively of whole genomes and a second that also included sequence fragments, we compared three MSA methods: (1) a purely automated approach using traditional software, (2) an automated approach including by eye manual editing, and (3) more recent fully automated approaches. To understand how these methods affect phylogenetic results, we compared resulting tree topologies based on these different alignment methods using multiple metrics. We further determined if the monophyly of existing HBV genotypes was supported in phylogenies estimated from each alignment type and under different statistical support thresholds. Traditional and fully automated alignments produced similar HBV phylogenies. Although there was variability between branch support thresholds, allowing lower support thresholds tended to result in more differences among trees. Therefore, differences between the trees could be best explained by phylogenetic uncertainty unrelated to the MSA method used. Nevertheless, automated alignment approaches did not require human intervention and were therefore considerably less time-intensive than traditional approaches. Because of this, we conclude that fully automated algorithms for MSA are fully compatible with older methods even in extremely difficult to align data sets. Additionally, we found that most HBV diagnostic genotypes did not correspond to evolutionarily-sound groups, regardless of alignment type and support threshold. This suggests there may be errors in genotype classification in the database or that HBV genotypes may need a revision.

Keywords: Automated alignment; Genome; HBV; Manual alignment; Virus; s-region.

Associated data

Dryad/10.5061/dryad.nc220

Grants and funding

This work was supported by the National Science Foundation (DEB-1239788 and DEB-1342604), which paid for some computational resources and the salaries of Andrew D. Sweet, Bret M. Boyd, and Julie M. Allen. Computational support was provided by the Extreme Science and Engineering Discovery Environment [TG-ASC160042] grant to Nam-phuong D. Nguyen. There was no additional external funding received for this study. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.