Reducing Information and Selection Bias in EHR-Linked Biobanks via Genetics-Informed Multiple Imputation and Sample Weighting

medRxiv [Preprint]. 2024 Oct 29:2024.10.28.24316286. doi: 10.1101/2024.10.28.24316286.

Abstract

Electronic health records (EHRs) are valuable for public health and clinical research but are prone to many sources of bias, including missing data and non-probability selection. Missing data in EHRs is complex due to potential non-recording, fragmentation, or clinically informative absences. This study explores whether polygenic risk score (PRS)-informed multiple imputation for missing traits, combined with sample weighting, can mitigate missing data and selection biases in estimating disease-exposure associations. Simulations were conducted for missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR) conditions under different sampling mechanisms. PRS-informed multiple imputation showed generally lower bias, particularly when combined with sample weighting. For example, in biased samples of 10,000 with exposure and outcome MAR data, PRS-informed imputation had lower percent bias (3.8%) and better coverage rate (0.883) compared to PRS-uninformed (4.5%; 0.877) and complete case analyses (10.3%; 0.784) in covariate-adjusted, weighted, multiple imputation scenarios. In a case study using Michigan Genomics Initiative (n=50,026) data, PRS-informed imputation aligned more closely with a sample-weighted All of Us-derived benchmark than analyses ignoring missing data and selection bias. Researchers should consider leveraging genetic data and sample weighting to address biases from missing data and non-probability sampling in biobanks.

Publication types

  • Preprint