Systematic dissection of biases in whole-exome and whole-genome sequencing reveals major determinants of coding sequence coverage

Sci Rep. 2020 Feb 6;10(1):2057. doi: 10.1038/s41598-020-59026-y.

Abstract

Advantages and diagnostic effectiveness of the two most widely used resequencing approaches, whole exome (WES) and whole genome (WGS) sequencing, are often debated. WES dominated large-scale resequencing projects because of lower cost and easier data storage and processing. Rapid development of 3rd generation sequencing methods and novel exome sequencing kits predicate the need for a robust statistical framework allowing informative and easy performance comparison of the emerging methods. In our study we developed a set of statistical tools to systematically assess coverage of coding regions provided by several modern WES platforms, as well as PCR-free WGS. We identified a substantial problem in most previously published comparisons which did not account for mappability limitations of short reads. Using regression analysis and simple machine learning, as well as several novel metrics of coverage evenness, we analyzed the contribution from the major determinants of CDS coverage. Contrary to a common view, most of the observed bias in modern WES stems from mappability limitations of short reads and exome probe design rather than sequence composition. We also identified the ~ 500 kb region of human exome that could not be effectively characterized using short read technology and should receive special attention during variant analysis. Using our novel metrics of sequencing coverage, we identified main determinants of WES and WGS performance. Overall, our study points out avenues for improvement of enrichment-based methods and development of novel approaches that would maximize variant discovery at optimal cost.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Base Sequence / genetics
  • Data Interpretation, Statistical
  • Exome / genetics*
  • Exome Sequencing / statistics & numerical data*
  • Genome, Human / genetics*
  • High-Throughput Nucleotide Sequencing / statistics & numerical data*
  • Humans
  • Machine Learning
  • Models, Genetic
  • Open Reading Frames / genetics
  • Regression Analysis
  • Whole Genome Sequencing / statistics & numerical data*