Merging of multi-string BWTs with applications

Bioinformatics. 2014 Dec 15;30(24):3524-31. doi: 10.1093/bioinformatics/btu584. Epub 2014 Aug 28.

Abstract

Motivation: The throughput of genomic sequencing has increased to the point that is overrunning the rate of downstream analysis. This, along with the desire to revisit old data, has led to a situation where large quantities of raw, and nearly impenetrable, sequence data are rapidly filling the hard drives of modern biology labs. These datasets can be compressed via a multi-string variant of the Burrows-Wheeler Transform (BWT), which provides the side benefit of searches for arbitrary k-mers within the raw data as well as the ability to reconstitute arbitrary reads as needed. We propose a method for merging such datasets for both increased compression and downstream analysis.

Results: We present a novel algorithm that merges multi-string BWTs in [Formula: see text] time where LCS is the length of their longest common substring between any of the inputs, and N is the total length of all inputs combined (number of symbols) using [Formula: see text] bits where F is the number of multi-string BWTs merged. This merged multi-string BWT is also shown to have a higher compressibility compared with the input multi-string BWTs separately. Additionally, we explore some uses of a merged multi-string BWT for bioinformatics applications.

Publication types

  • Research Support, N.I.H., Extramural

MeSH terms

  • Algorithms*
  • Animals
  • Data Compression
  • Genomics / methods
  • High-Throughput Nucleotide Sequencing / methods*
  • Mice
  • Sequence Alignment