Normics: Proteomic Normalization by Variance and Data-Inherent Correlation Structure

Franz F Dressler; Johannes Brägelmann; Markus Reischl; Sven Perner

doi:10.1016/j.mcpro.2022.100269

Normics: Proteomic Normalization by Variance and Data-Inherent Correlation Structure

Mol Cell Proteomics. 2022 Sep;21(9):100269. doi: 10.1016/j.mcpro.2022.100269. Epub 2022 Jul 16.

Authors

Franz F Dressler¹, Johannes Brägelmann², Markus Reischl³, Sven Perner⁴

Affiliations

¹ Institute of Pathology, Charité - Universitätsmedizin Berlin, Corporate Member of Freie Universität Berlin, Humboldt-Universität zu Berlin, and Berlin Institute of Health, Berlin, Germany; Institute of Pathology, University Medical Center Schleswig-Holstein, Luebeck Site, Luebeck, Germany. Electronic address: franz-friedrich.dressler@charite.de.
² Mildred Scheel School of Oncology, University of Cologne, Faculty of Medicine and University Hospital Cologne, Cologne, Germany; Department of Translational Genomics, University of Cologne, Faculty of Medicine and University Hospital Cologne, Cologne, Germany; Center for Molecular Medicine Cologne, University of Cologne, Faculty of Medicine and University Hospital Cologne, Cologne, Germany.
³ Institute for Automation and Applied Informatics, Karlsruhe Institute of Technology, Karlsruhe, Germany.
⁴ Institute of Pathology, University Medical Center Schleswig-Holstein, Luebeck Site, Luebeck, Germany; Institute of Pathology, Research Center Borstel, Leibniz Lung Center, Borstel, Germany.

Abstract

Several algorithms for the normalization of proteomic data are currently available, each based on a priori assumptions. Among these is the extent to which differential expression (DE) can be present in the dataset. This factor is usually unknown in explorative biomarker screens. Simultaneously, the increasing depth of proteomic analyses often requires the selection of subsets with a high probability of being DE to obtain meaningful results in downstream bioinformatical analyses. Based on the relationship of technical variation and (true) biological DE of an unknown share of proteins, we propose the "Normics" algorithm: Proteins are ranked based on their expression level-corrected variance and the mean correlation with all other proteins. The latter serves as a novel indicator of the non-DE likelihood of a protein in a given dataset. Subsequent normalization is based on a subset of non-DE proteins only. No a priori information such as batch, clinical, or replicate group is necessary. Simulation data demonstrated robust and superior performance across a wide range of stochastically chosen parameters. Five publicly available spike-in and biologically variant datasets were reliably and quantitively accurately normalized by Normics with improved performance compared to standard variance stabilization as well as median, quantile, and LOESS normalizations. In complex biological datasets Normics correctly determined proteins as being DE that had been cross-validated by an independent transcriptome analysis of the same samples. In both complex datasets Normics identified the most DE proteins. We demonstrate that combining variance analysis and data-inherent correlation structure to identify non-DE proteins improves data normalization. Standard normalization algorithms can be consolidated against high shares of (one-sided) biological regulation. The statistical power of downstream analyses can be increased by focusing on Normics-selected subsets of high DE likelihood.

Keywords: Data normalization; Differential expression analysis; Omics data; Protein quantitation; Proteomics.

MeSH terms

Algorithms
Analysis of Variance
Computer Simulation
Gene Expression Profiling* / methods
Proteins
Proteomics* / methods

Substances

Proteins