A comparison between similarity matrices for principal component analysis to assess population stratification in sequenced genetic data sets

Sanghun Lee; Georg Hahn; Julian Hecker; Sharon M Lutz; Kristina Mullin; Alzheimer’s Disease Neuroimaging Initiative (ADNI); Winston Hide; Lars Bertram; Dawn L DeMeo; Rudolph E Tanzi; Christoph Lange; Dmitry Prokopenko

doi:10.1093/bib/bbac611

A comparison between similarity matrices for principal component analysis to assess population stratification in sequenced genetic data sets

Brief Bioinform. 2023 Jan 19;24(1):bbac611. doi: 10.1093/bib/bbac611.

Authors

Sanghun Lee^{1

2

3

4}, Georg Hahn¹, Julian Hecker^{2

5}, Sharon M Lutz^{1

5

6}, Kristina Mullin⁷; Alzheimer’s Disease Neuroimaging Initiative (ADNI); Winston Hide^{5

8}, Lars Bertram^{9

10}, Dawn L DeMeo^{2

5}, Rudolph E Tanzi^{5

7}, Christoph Lange^{1

2}, Dmitry Prokopenko^{5

7}

Affiliations

¹ Department of Biostatistics, T.H. Chan School of Public Health, Harvard University, Boston, MA, USA.
² Channing Division of Network Medicine, Brigham and Women's Hospital, Boston, MA, USA.
³ Department of Medical Consilience, Division of Medicine, Graduate school, Dankook University, South Korea.
⁴ NH Institute for Natural Product Research, Myungji Hospital, South Korea.
⁵ Harvard Medical School, Boston, MA, USA.
⁶ Department of Population Medicine, Harvard Pilgrim Health Care Institute, Boston, MA, USA.
⁷ Genetics and Aging Unit and McCance Center for Brain Health, Department of Neurology, Massachusetts General Hospital, Boston, MA, USA.
⁸ Department of Pathology, Beth Israel Deaconess Medical Center, Boston, MA, USA.
⁹ Lübeck Interdisciplinary Platform for Genome Analytics, University of Lübeck, Lübeck, Germany.
¹⁰ Department of Psychology, University of Oslo, Oslo, Norway.

Abstract

Genetic similarity matrices are commonly used to assess population substructure (PS) in genetic studies. Through simulation studies and by the application to whole-genome sequencing (WGS) data, we evaluate the performance of three genetic similarity matrices: the unweighted and weighted Jaccard similarity matrices and the genetic relationship matrix. We describe different scenarios that can create numerical pitfalls and lead to incorrect conclusions in some instances. We consider scenarios in which PS is assessed based on loci that are located across the genome ('globally') and based on loci from a specific genomic region ('locally'). We also compare scenarios in which PS is evaluated based on loci from different minor allele frequency bins: common (>5%), low-frequency (5-0.5%) and rare (<0.5%) single-nucleotide variations (SNVs). Overall, we observe that all approaches provide the best clustering performance when computed based on rare SNVs. The performance of the similarity matrices is very similar for common and low-frequency variants, but for rare variants, the unweighted Jaccard matrix provides preferable clustering features. Based on visual inspection and in terms of standard clustering metrics, its clusters are the densest and the best separated in the principal component analysis of variants with rare SNVs compared with the other methods and different allele frequency cutoffs. In an application, we assessed the role of rare variants on local and global PS, using WGS data from multiethnic Alzheimer's disease data sets and European or East Asian populations from the 1000 Genome Project.

Keywords: Jaccard matrix; genetic relationship matrix; population stratification; principal component analysis; rare variant; similarity matrix.

Publication types

Research Support, N.I.H., Extramural
Research Support, Non-U.S. Gov't

MeSH terms

Computer Simulation
Gene Frequency
Genome*
Genome-Wide Association Study
Genomics*
Polymorphism, Single Nucleotide
Principal Component Analysis

Abstract

Publication types

MeSH terms

Grants and funding