Pathoscope: species identification and strain attribution with unassembled sequencing data

Owen E Francis; Matthew Bendall; Solaiappan Manimaran; Changjin Hong; Nathan L Clement; Eduardo Castro-Nallar; Quinn Snell; G Bruce Schaalje; Mark J Clement; Keith A Crandall; W Evan Johnson

doi:10.1101/gr.150151.112

Pathoscope: species identification and strain attribution with unassembled sequencing data

Genome Res. 2013 Oct;23(10):1721-9. doi: 10.1101/gr.150151.112. Epub 2013 Jul 10.

Authors

Owen E Francis¹, Matthew Bendall, Solaiappan Manimaran, Changjin Hong, Nathan L Clement, Eduardo Castro-Nallar, Quinn Snell, G Bruce Schaalje, Mark J Clement, Keith A Crandall, W Evan Johnson

Affiliation

¹ Department of Statistics, Brigham Young University, Provo, Utah 84602, USA;

Abstract

Emerging next-generation sequencing technologies have revolutionized the collection of genomic data for applications in bioforensics, biosurveillance, and for use in clinical settings. However, to make the most of these new data, new methodology needs to be developed that can accommodate large volumes of genetic data in a computationally efficient manner. We present a statistical framework to analyze raw next-generation sequence reads from purified or mixed environmental or targeted infected tissue samples for rapid species identification and strain attribution against a robust database of known biological agents. Our method, Pathoscope, capitalizes on a Bayesian statistical framework that accommodates information on sequence quality, mapping quality, and provides posterior probabilities of matches to a known database of target genomes. Importantly, our approach also incorporates the possibility that multiple species can be present in the sample and considers cases when the sample species/strain is not in the reference database. Furthermore, our approach can accurately discriminate between very closely related strains of the same species with very little coverage of the genome and without the need for multiple alignment steps, extensive homology searches, or genome assembly--which are time-consuming and labor-intensive steps. We demonstrate the utility of our approach on genomic data from purified and in silico "environmental" samples from known bacterial agents impacting human health for accuracy assessment and comparison with other approaches.

Publication types

Research Support, N.I.H., Extramural

MeSH terms

Algorithms
Bacillus anthracis / genetics
Bacteria / classification*
Bacteria / genetics*
Bayes Theorem
Bioterrorism
Burkholderia mallei / genetics
Burkholderia pseudomallei / genetics
Clostridium botulinum / genetics
Computational Biology / methods*
Databases, Genetic*
Escherichia coli / genetics
Escherichia coli Infections / microbiology
Europe
Francisella tularensis / genetics
Genome, Bacterial*
Genomics
High-Throughput Nucleotide Sequencing
Humans
Sequence Analysis, DNA*
Software*
Species Specificity
Yersinia pestis / genetics

Abstract

Publication types

MeSH terms

Grants and funding