MODMatcher: multi-omics data matcher for integrative genomic analysis

Seungyeul Yoo; Tao Huang; Joshua D Campbell; Eunjee Lee; Zhidong Tu; Mark W Geraci; Charles A Powell; Eric E Schadt; Avrum Spira; Jun Zhu

doi:10.1371/journal.pcbi.1003790

MODMatcher: multi-omics data matcher for integrative genomic analysis

PLoS Comput Biol. 2014 Aug 14;10(8):e1003790. doi: 10.1371/journal.pcbi.1003790. eCollection 2014 Aug.

Authors

Affiliations

¹ Icahn Institute of Genomics and Multiscale Biology, Icahn School of Medicine at Mount Sinai, New York, New York, United States of America; Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, New York, United States of America.
² Division of Computational Biomedicine, Department of Medicine, Boston University School of Medicine, Boston, Massachusetts, United States of America.
³ Division of Pulmonary Sciences and Critical Care Medicine, University of Colorado Denver, Aurora, Colorado, United States of America.
⁴ Division of Pulmonary, Critical Care and Sleep Medicine, Icahn School of Medicine at Mount Sinai, New York, New York, United States of America.

Abstract

Errors in sample annotation or labeling often occur in large-scale genetic or genomic studies and are difficult to avoid completely during data generation and management. For integrative genomic studies, it is critical to identify and correct these errors. Different types of genetic and genomic data are inter-connected by cis-regulations. On that basis, we developed a computational approach, Multi-Omics Data Matcher (MODMatcher), to identify and correct sample labeling errors in multiple types of molecular data, which can be used in further integrative analysis. Our results indicate that inspection of sample annotation and labeling error is an indispensable data quality assurance step. Applied to a large lung genomic study, MODMatcher increased statistically significant genetic associations and genomic correlations by more than two-fold. In a simulation study, MODMatcher provided more robust results by using three types of omics data than two types of omics data. We further demonstrate that MODMatcher can be broadly applied to large genomic data sets containing multiple types of omics data, such as The Cancer Genome Atlas (TCGA) data sets.

Publication types

Research Support, N.I.H., Extramural
Research Support, Non-U.S. Gov't

MeSH terms

DNA Methylation
Databases, Genetic*
Female
Gene Expression Profiling
Genomics / methods*
Humans
Male
Molecular Sequence Annotation / methods*
Neoplasms / genetics
Polymorphism, Single Nucleotide
Sequence Analysis, DNA
Sequence Analysis, RNA

Abstract

Publication types

MeSH terms

Grants and funding