Reverse Transcription Errors and RNA-DNA Differences at Short Tandem Repeats

Arkarachai Fungtammasan; Marta Tomaszkiewicz; Rebeca Campos-Sánchez; Kristin A Eckert; Michael DeGiorgio; Kateryna D Makova

doi:10.1093/molbev/msw139

Reverse Transcription Errors and RNA-DNA Differences at Short Tandem Repeats

Mol Biol Evol. 2016 Oct;33(10):2744-58. doi: 10.1093/molbev/msw139. Epub 2016 Jul 12.

Authors

Arkarachai Fungtammasan¹, Marta Tomaszkiewicz², Rebeca Campos-Sánchez², Kristin A Eckert³, Michael DeGiorgio⁴, Kateryna D Makova⁵

Affiliations

¹ Integrative Biosciences, Bioinformatics and Genomics Option, Pennsylvania State University Department of Biology, Pennsylvania State University Center for Medical Genomics, Pennsylvania State University Huck Institute of Genome Sciences, Pennsylvania State University.
² Department of Biology, Pennsylvania State University Center for Medical Genomics, Pennsylvania State University.
³ Center for Medical Genomics, Pennsylvania State University Department of Pathology, The Jake Gittlen Laboratories for Cancer Research, The Pennsylvania State University College of Medicine.
⁴ Department of Biology, Pennsylvania State University Center for Medical Genomics, Pennsylvania State University Institute for CyberScience, Pennsylvania State University kdm16@psu.edu mxd60@psu.edu.
⁵ Department of Biology, Pennsylvania State University Center for Medical Genomics, Pennsylvania State University Huck Institute of Genome Sciences, Pennsylvania State University kdm16@psu.edu mxd60@psu.edu.

Abstract

Transcript variation has important implications for organismal function in health and disease. Most transcriptome studies focus on assessing variation in gene expression levels and isoform representation. Variation at the level of transcript sequence is caused by RNA editing and transcription errors, and leads to nongenetically encoded transcript variants, or RNA-DNA differences (RDDs). Such variation has been understudied, in part because its detection is obscured by reverse transcription (RT) and sequencing errors. It has only been evaluated for intertranscript base substitution differences. Here, we investigated transcript sequence variation for short tandem repeats (STRs). We developed the first maximum-likelihood estimator (MLE) to infer RT error and RDD rates, taking next generation sequencing error rates into account. Using the MLE, we empirically evaluated RT error and RDD rates for STRs in a large-scale DNA and RNA replicated sequencing experiment conducted in a primate species. The RT error rates increased exponentially with STR length and were biased toward expansions. The RDD rates were approximately 1 order of magnitude lower than the RT error rates. The RT error rates estimated with the MLE from a primate data set were concordant with those estimated with an independent method, barcoded RNA sequencing, from a Caenorhabditis elegans data set. Our results have important implications for medical genomics, as STR allelic variation is associated with >40 diseases. STR nonallelic transcript variation can also contribute to disease phenotype. The MLE and empirical rates presented here can be used to evaluate the probability of disease-associated transcripts arising due to RDD.

Keywords: RNA sequencing; RNA–DNA differences; error correction model.; microsatellites; reverse transcription errors; sequencing errors; tandem repeats; transcription errors.

Publication types

Research Support, N.I.H., Extramural
Research Support, U.S. Gov't, Non-P.H.S.
Research Support, Non-U.S. Gov't

MeSH terms

Alleles
DNA / genetics*
DNA Repair
Genetic Variation
Genomics / methods
High-Throughput Nucleotide Sequencing / methods
Humans
Microsatellite Repeats*
RNA / genetics*
Reverse Transcription*
Sequence Analysis, RNA
Transcriptome

Substances

RNA
DNA

Grants and funding

R01 GM087472/GM/NIGMS NIH HHS/United States