Considerations for Optimization of High-Throughput Sequencing Bioinformatics Pipelines for Virus Detection

Christophe Lambert; Cassandra Braxton; Robert L Charlebois; Avisek Deyati; Paul Duncan; Fabio La Neve; Heather D Malicki; Sebastien Ribrioux; Daniel K Rozelle; Brandye Michaels; Wenping Sun; Zhihui Yang; Arifa S Khan

doi:10.3390/v10100528

Considerations for Optimization of High-Throughput Sequencing Bioinformatics Pipelines for Virus Detection

Viruses. 2018 Sep 27;10(10):528. doi: 10.3390/v10100528.

Authors

Affiliations

¹ GSK, 1330 Rixensart, Belgium. christophe.g.lambert@gsk.com.
² Biogen Inc., Research Triangle Park, NC 27709, USA. cassandra.braxton@biogen.com.
³ Analytical Research and Development, Sanofi Pasteur, Toronto, ON M2R 3T4, Canada. Robert.Charlebois@sanofi.com.
⁴ GSK, 1330 Rixensart, Belgium. avis.quest@gmail.com.
⁵ Merck & Co. Inc., West Point, PA 19486, USA. paul_duncan@merck.com.
⁶ Merck KGaA, 10010 Torino, Italy. fabioinusa@gmail.com.
⁷ WuXi AppTec, Philadelphia, PA 19112, USA. Heather.Malicki@wuxiapptec.com.
⁸ Genedata AG, 4053 Basel, Switzerland. sebastien.ribrioux@genedata.com.
⁹ Radiant Systems, Inc., Plainfield, NJ 07080, USA. dan.rozelle@gmail.com.
¹⁰ Analytical Research and Development: Microbiology, Pfizer Inc., Andover, MA 01810, USA. Brandye.Michaels@pfizer.com.
¹¹ WuXi AppTec, Philadelphia, PA 19112, USA. kathryn.sun@gmail.com.
¹² Office of Applied Research and Safety Assessment, Center for Food Safety and Applied Nutrition, U.S. Food and Drug Administration, Laurel, MD 20708, USA. Zhihui.Yang@fda.hhs.gov.
¹³ Office of Vaccines Research and Review, Center for Biologics Evaluation and Research, U.S. Food and Drug Administration, Silver Spring, MD 20993, USA. Arifa.Khan@fda.hhs.gov.

Abstract

High-throughput sequencing (HTS) has demonstrated capabilities for broad virus detection based upon discovery of known and novel viruses in a variety of samples, including clinical, environmental, and biological. An important goal for HTS applications in biologics is to establish parameter settings that can afford adequate sensitivity at an acceptable computational cost (computation time, computer memory, storage, expense or/and efficiency), at critical steps in the bioinformatics pipeline, including initial data quality assessment, trimming/cleaning, and assembly (to reduce data volume and increase likelihood of appropriate sequence identification). Additionally, the quality and reliability of the results depend on the availability of a complete and curated viral database for obtaining accurate results; selection of sequence alignment programs and their configuration, that retains specificity for broad virus detection with reduced false-positive signals; removal of host sequences without loss of endogenous viral sequences of interest; and use of a meaningful reporting format, which can retain critical information of the analysis for presentation of readily interpretable data and actionable results. Furthermore, after alignment, both automated and manual evaluation may be needed to verify the results and help assign a potential risk level to residual, unmapped reads. We hope that the collective considerations discussed in this paper aid toward optimization of data analysis pipelines for virus detection by HTS.

Keywords: adventitious virus detection; bioinformatics pipeline; high-throughput sequencing.

MeSH terms

Computational Biology*
DNA, Viral / genetics*
Data Accuracy
Databases as Topic
High-Throughput Nucleotide Sequencing*
RNA, Viral / genetics*
Reproducibility of Results
Research Design
Sequence Alignment
Sequence Analysis
Software
Viruses / genetics
Viruses / isolation & purification*

Substances

DNA, Viral
RNA, Viral