Exploring structural variation and gene family architecture with De Novo assemblies of 15 Medicago genomes

BMC Genomics. 2017 Mar 27;18(1):261. doi: 10.1186/s12864-017-3654-1.

Abstract

Background: Previous studies exploring sequence variation in the model legume, Medicago truncatula, relied on mapping short reads to a single reference. However, read-mapping approaches are inadequate to examine large, diverse gene families or to probe variation in repeat-rich or highly divergent genome regions. De novo sequencing and assembly of M. truncatula genomes enables near-comprehensive discovery of structural variants (SVs), analysis of rapidly evolving gene families, and ultimately, construction of a pan-genome.

Results: Genome-wide synteny based on 15 de novo M. truncatula assemblies effectively detected different types of SVs indicating that as much as 22% of the genome is involved in large structural changes, altogether affecting 28% of gene models. A total of 63 million base pairs (Mbp) of novel sequence was discovered, expanding the reference genome space for Medicago by 16%. Pan-genome analysis revealed that 42% (180 Mbp) of genomic sequences is missing in one or more accession, while examination of de novo annotated genes identified 67% (50,700) of all ortholog groups as dispensable - estimates comparable to recent studies in rice, maize and soybean. Rapidly evolving gene families typically associated with biotic interactions and stress response were found to be enriched in the accession-specific gene pool. The nucleotide-binding site leucine-rich repeat (NBS-LRR) family, in particular, harbors the highest level of nucleotide diversity, large effect single nucleotide change, protein diversity, and presence/absence variation. However, the leucine-rich repeat (LRR) and heat shock gene families are disproportionately affected by large effect single nucleotide changes and even higher levels of copy number variation.

Conclusions: Analysis of multiple M. truncatula genomes illustrates the value of de novo assemblies to discover and describe structural variation, something that is often under-estimated when using read-mapping approaches. Comparisons among the de novo assemblies also indicate that different large gene families differ in the architecture of their structural variation.

MeSH terms

  • Comparative Genomic Hybridization
  • DNA Copy Number Variations / genetics*
  • Genome, Plant*
  • Heat-Shock Proteins / genetics
  • High-Throughput Nucleotide Sequencing
  • Leucine-Rich Repeat Proteins
  • Medicago truncatula / genetics*
  • Plant Proteins / genetics
  • Proteins / genetics
  • RNA, Plant / chemistry
  • RNA, Plant / isolation & purification
  • RNA, Plant / metabolism
  • Sequence Alignment
  • Sequence Analysis, DNA

Substances

  • Heat-Shock Proteins
  • Leucine-Rich Repeat Proteins
  • Plant Proteins
  • Proteins
  • RNA, Plant