Fast genotyping of known SNPs through approximate k-mer matching

Ariya Shajii; Deniz Yorukoglu; Yun William Yu; Bonnie Berger

doi:10.1093/bioinformatics/btw460

Fast genotyping of known SNPs through approximate k-mer matching

Bioinformatics. 2016 Sep 1;32(17):i538-i544. doi: 10.1093/bioinformatics/btw460.

Authors

Ariya Shajii¹, Deniz Yorukoglu², Yun William Yu³, Bonnie Berger³

Affiliations

¹ Department of Electrical & Computer Engineering, Boston University, Boston, MA 02215, USA.
² Computer Science and AI Lab.
³ Computer Science and AI Lab Department of Mathematics, Massachusetts Institute of Technology, Cambridge, MA 02139, USA.

Abstract

Motivation: As the volume of next-generation sequencing (NGS) data increases, faster algorithms become necessary. Although speeding up individual components of a sequence analysis pipeline (e.g. read mapping) can reduce the computational cost of analysis, such approaches do not take full advantage of the particulars of a given problem. One problem of great interest, genotyping a known set of variants (e.g. dbSNP or Affymetrix SNPs), is important for characterization of known genetic traits and causative disease variants within an individual, as well as the initial stage of many ancestral and population genomic pipelines (e.g. GWAS).

Results: We introduce lightweight assignment of variant alleles (LAVA), an NGS-based genotyping algorithm for a given set of SNP loci, which takes advantage of the fact that approximate matching of mid-size k-mers (with k = 32) can typically uniquely identify loci in the human genome without full read alignment. LAVA accurately calls the vast majority of SNPs in dbSNP and Affymetrix's Genome-Wide Human SNP Array 6.0 up to about an order of magnitude faster than standard NGS genotyping pipelines. For Affymetrix SNPs, LAVA has significantly higher SNP calling accuracy than existing pipelines while using as low as ∼5 GB of RAM. As such, LAVA represents a scalable computational method for population-level genotyping studies as well as a flexible NGS-based replacement for SNP arrays.

Availability and implementation: LAVA software is available at http://lava.csail.mit.edu

Contact: bab@mit.edu

Supplementary information: Supplementary data are available at Bioinformatics online.

MeSH terms

Algorithms*
Alleles
Cluster Analysis
Genome, Human
Genotype*
Humans
Polymorphism, Single Nucleotide*
Software

Abstract

MeSH terms

Grants and funding