Repeat- and error-aware comparison of deletions

Bioinformatics. 2015 Sep 15;31(18):2947-54. doi: 10.1093/bioinformatics/btv304. Epub 2015 May 15.

Abstract

Motivation: The number of reported genetic variants is rapidly growing, empowered by ever faster accumulation of next-generation sequencing data. A major issue is comparability. Standards that address the combined problem of inaccurately predicted breakpoints and repeat-induced ambiguities are missing. This decisively lowers the quality of 'consensus' callsets and hampers the removal of duplicate entries in variant databases, which can have deleterious effects in downstream analyses.

Results: We introduce a sound framework for comparison of deletions that captures both tool-induced inaccuracies and repeat-induced ambiguities. We present a maximum matching algorithm that outputs virtual duplicates among two sets of predictions/annotations. We demonstrate that our approach is clearly superior over ad hoc criteria, like overlap, and that it can reduce the redundancy among callsets substantially. We also identify large amounts of duplicate entries in the Database of Genomic Variants, which points out the immediate relevance of our approach.

Availability and implementation: Implementation is open source and available from https://bitbucket.org/readdi/readdi

Contact: roland.wittler@uni-bielefeld.de or t.marschall@mpi-inf.mpg.de

Supplementary information: Supplementary data are available at Bioinformatics online.

Publication types

  • Comparative Study
  • Research Support, Non-U.S. Gov't

MeSH terms

  • Algorithms*
  • Computational Biology / methods*
  • Databases, Factual
  • Genetic Variation / genetics*
  • Genomics / methods
  • High-Throughput Nucleotide Sequencing / methods
  • Humans
  • Models, Theoretical
  • Repetitive Sequences, Nucleic Acid / genetics*
  • Sequence Analysis, DNA / standards*
  • Sequence Deletion / genetics*
  • Software*