Bamgineer: Introduction of simulated allele-specific copy number variants into exome and targeted sequence data sets

PLoS Comput Biol. 2018 Mar 28;14(3):e1006080. doi: 10.1371/journal.pcbi.1006080. eCollection 2018 Mar.

Abstract

Somatic copy number variations (CNVs) play a crucial role in development of many human cancers. The broad availability of next-generation sequencing data has enabled the development of algorithms to computationally infer CNV profiles from a variety of data types including exome and targeted sequence data; currently the most prevalent types of cancer genomics data. However, systemic evaluation and comparison of these tools remains challenging due to a lack of ground truth reference sets. To address this need, we have developed Bamgineer, a tool written in Python to introduce user-defined haplotype-phased allele-specific copy number events into an existing Binary Alignment Mapping (BAM) file, with a focus on targeted and exome sequencing experiments. As input, this tool requires a read alignment file (BAM format), lists of non-overlapping genome coordinates for introduction of gains and losses (bed file), and an optional file defining known haplotypes (vcf format). To improve runtime performance, Bamgineer introduces the desired CNVs in parallel using queuing and parallel processing on a local machine or on a high-performance computing cluster. As proof-of-principle, we applied Bamgineer to a single high-coverage (mean: 220X) exome sequence file from a blood sample to simulate copy number profiles of 3 exemplar tumors from each of 10 tumor types at 5 tumor cellularity levels (20-100%, 150 BAM files in total). To demonstrate feasibility beyond exome data, we introduced read alignments to a targeted 5-gene cell-free DNA sequencing library to simulate EGFR amplifications at frequencies consistent with circulating tumor DNA (10, 1, 0.1 and 0.01%) while retaining the multimodal insert size distribution of the original data. We expect Bamgineer to be of use for development and systematic benchmarking of CNV calling algorithms by users using locally-generated data for a variety of applications. The source code is freely available at http://github.com/pughlab/bamgineer.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Algorithms
  • Alleles
  • Computer Simulation
  • DNA Copy Number Variations / genetics
  • Exome / genetics
  • Exome Sequencing / methods
  • Gene Frequency / genetics
  • Genomics
  • Haplotypes / genetics
  • High-Throughput Nucleotide Sequencing
  • Humans
  • Neoplasms / genetics
  • Sequence Alignment / methods*
  • Sequence Analysis, DNA / methods*
  • Software

Grants and funding

This work was supported by funding to TJP from the Princess Margaret Cancer Foundation; Canada Research Chairs Program; Cancer Research Society and Canadian Neuroendocrine Tumour Society; Canada Foundation for Innovation, Leaders Opportunity Fund, CFI #32383; and the Ontario Ministry of Research and Innovation, Ontario Research Fund Small Infrastructure Program. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.