Repeat- and error-aware comparison of deletions

Roland Wittler; Tobias Marschall; Alexander Schönhuth; Veli Mäkinen

doi:10.1093/bioinformatics/btv304

Repeat- and error-aware comparison of deletions

Bioinformatics. 2015 Sep 15;31(18):2947-54. doi: 10.1093/bioinformatics/btv304. Epub 2015 May 15.

Authors

Roland Wittler¹, Tobias Marschall¹, Alexander Schönhuth¹, Veli Mäkinen¹

Affiliation

¹ Genome Informatics, Faculty of Technology and Center for Biotechnology (CeBiTec), Bielefeld University, Germany, Center for Bioinformatics, Saarland University and Department of Computational Biology and Applied Algorithmics, Max Planck Institute for Informatics, Saarbrücken, Germany, Centrum Wiskunde & Informatica (CWI), Life Sciences Group, Amsterdam, The Netherlands and Helsinki Institute for Information Technology (HIIT), Department of Computer Science, University of Helsinki, Finland.

PMID: 25979471
DOI: 10.1093/bioinformatics/btv304

Abstract

Motivation: The number of reported genetic variants is rapidly growing, empowered by ever faster accumulation of next-generation sequencing data. A major issue is comparability. Standards that address the combined problem of inaccurately predicted breakpoints and repeat-induced ambiguities are missing. This decisively lowers the quality of 'consensus' callsets and hampers the removal of duplicate entries in variant databases, which can have deleterious effects in downstream analyses.

Results: We introduce a sound framework for comparison of deletions that captures both tool-induced inaccuracies and repeat-induced ambiguities. We present a maximum matching algorithm that outputs virtual duplicates among two sets of predictions/annotations. We demonstrate that our approach is clearly superior over ad hoc criteria, like overlap, and that it can reduce the redundancy among callsets substantially. We also identify large amounts of duplicate entries in the Database of Genomic Variants, which points out the immediate relevance of our approach.

Availability and implementation: Implementation is open source and available from https://bitbucket.org/readdi/readdi

Contact: roland.wittler@uni-bielefeld.de or t.marschall@mpi-inf.mpg.de

Supplementary information: Supplementary data are available at Bioinformatics online.

Publication types

Comparative Study
Research Support, Non-U.S. Gov't

MeSH terms

Algorithms*
Computational Biology / methods*
Databases, Factual
Genetic Variation / genetics*
Genomics / methods
High-Throughput Nucleotide Sequencing / methods
Humans
Models, Theoretical
Repetitive Sequences, Nucleic Acid / genetics*
Sequence Analysis, DNA / standards*
Sequence Deletion / genetics*
Software*