A hierarchical approach to removal of unwanted variation for large-scale metabolomics data

Taiyun Kim; Owen Tang; Stephen T Vernon; Katharine A Kott; Yen Chin Koay; John Park; David E James; Stuart M Grieve; Terence P Speed; Pengyi Yang; Gemma A Figtree; John F O'Sullivan; Jean Yee Hwa Yang

doi:10.1038/s41467-021-25210-5

A hierarchical approach to removal of unwanted variation for large-scale metabolomics data

Nat Commun. 2021 Aug 17;12(1):4992. doi: 10.1038/s41467-021-25210-5.

Authors

Taiyun Kim^#^{1

2

3}, Owen Tang^#^{1

4

5

6}, Stephen T Vernon^{1

4

5

6}, Katharine A Kott^{1

4

5

6}, Yen Chin Koay^{1

6

7}, John Park^{1

4

5

6}, David E James^{1

8

9}, Stuart M Grieve^{1

10

11}, Terence P Speed^{12

13}, Pengyi Yang^{1

2

3

6}, Gemma A Figtree^{1

4

5

6}, John F O'Sullivan^{1

6

7

14

15}, Jean Yee Hwa Yang^{16

17}

Affiliations

¹ Charles Perkins Centre, The University of Sydney, Sydney, NSW, Australia.
² School of Mathematics and Statistics, The University of Sydney, Sydney, NSW, Australia.
³ Computational Systems Biology Group, Children's Medical Research Institute, Westmead, NSW, Australia.
⁴ Department of Cardiology, Royal North Shore Hospital, Sydney, NSW, Australia.
⁵ Cardiovascular Discovery Group, Kolling Institute of Medical Research, The University of Sydney, Sydney, NSW, Australia.
⁶ Faculty of Medicine and Health, The University of Sydney, Sydney, NSW, Australia.
⁷ Heart Research Institute, Sydney, NSW, Australia.
⁸ School of Life and Environmental Sciences, The University of Sydney, Sydney, NSW, Australia.
⁹ School of Medical Sciences, University of Sydney, Sydney, Australia.
¹⁰ Imaging and Phenotyping Laboratory, Charles Perkins Centre, University of Sydney, Sydney, Australia.
¹¹ Department of Radiology, Royal Prince Alfred Hospital, Camperdown, Australia.
¹² Bioinformatics Division, Walter Eliza Hall Institute, Parkville, VIC, Australia.
¹³ School of Mathematics and Statistics, University of Melbourne, Parkville, VIC, Australia.
¹⁴ Department of Cardiology, Royal Prince Alfred Hospital, Sydney, NSW, Australia.
¹⁵ Faculty of Medicine, TU Dresden, Germany.
¹⁶ Charles Perkins Centre, The University of Sydney, Sydney, NSW, Australia. jean.yang@sydney.edu.au.
¹⁷ School of Mathematics and Statistics, The University of Sydney, Sydney, NSW, Australia. jean.yang@sydney.edu.au.

^# Contributed equally.

Abstract

Liquid chromatography-mass spectrometry-based metabolomics studies are increasingly applied to large population cohorts, which run for several weeks or even years in data acquisition. This inevitably introduces unwanted intra- and inter-batch variations over time that can overshadow true biological signals and thus hinder potential biological discoveries. To date, normalisation approaches have struggled to mitigate the variability introduced by technical factors whilst preserving biological variance, especially for protracted acquisitions. Here, we propose a study design framework with an arrangement for embedding biological sample replicates to quantify variance within and between batches and a workflow that uses these replicates to remove unwanted variation in a hierarchical manner (hRUV). We use this design to produce a dataset of more than 1000 human plasma samples run over an extended period of time. We demonstrate significant improvement of hRUV over existing methods in preserving biological signals whilst removing unwanted variation for large scale metabolomics studies. Our tools not only provide a strategy for large scale data normalisation, but also provides guidance on the design strategy for large omics studies.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Chromatography, Liquid
Humans
Mass Spectrometry / methods
Metabolomics / methods*
Models, Biological
Workflow