Motivation: Analyzing k-mer frequencies in whole-genome sequencing data is becoming a common method for estimating genome size (GS). However, it remains uninvestigated how accurate the method is, especially if it can capture intra-species GS variation.
Results: We present findGSE, which fits skew normal distributions to k-mer frequencies to estimate GS. findGSE outperformed existing tools in an extensive simulation study. Estimating GSs of 89 Arabidopsis thaliana accessions, findGSE showed the highest capability in capturing GS variations. In an application with 71 female and 71 male human individuals, findGSE delivered an average of 3039 Mb as haploid human GS, while female genomes were on average 41 Mb larger than male genomes, in astonishing agreement with size difference of the X and Y chromosomes. Further analysis showed that human GS variations link to geographical patterns and significant differences between populations, which can be explained by variable abundances of LINE-1 retrotransposons.
Availability and implementation: R package of findGSE is freely available at https://github.com/schneebergerlab/findGSE and supported on linux and Mac systems.
Contact: schneeberger@mpipz.mpg.de.
Supplementary information: Supplementary data are available at Bioinformatics online.
© The Author 2017. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com