Systematic auditing is essential to debiasing machine learning in biology

Fatma-Elzahraa Eid; Haitham A Elmarakeby; Yujia Alina Chan; Nadine Fornelos; Mahmoud ElHefnawi; Eliezer M Van Allen; Lenwood S Heath; Kasper Lage

doi:10.1038/s42003-021-01674-5

Systematic auditing is essential to debiasing machine learning in biology

Commun Biol. 2021 Feb 10;4(1):183. doi: 10.1038/s42003-021-01674-5.

Authors

Fatma-Elzahraa Eid^#^{1

2}, Haitham A Elmarakeby^#^{3

4

5}, Yujia Alina Chan^#³, Nadine Fornelos³, Mahmoud ElHefnawi⁶, Eliezer M Van Allen^{3

5}, Lenwood S Heath⁷, Kasper Lage^{8

9

10}

Affiliations

¹ Broad Institute of MIT and Harvard, Cambridge, MA, USA. fatma@broadinstitute.org.
² Department of Systems and Computer Engineering, Al-Azhar University, Cairo, Egypt. fatma@broadinstitute.org.
³ Broad Institute of MIT and Harvard, Cambridge, MA, USA.
⁴ Department of Systems and Computer Engineering, Al-Azhar University, Cairo, Egypt.
⁵ Dana-Farber Cancer Institute, Boston, MA, USA.
⁶ Informatics and Systems Department, Division of Engineering Research, National Research Centre, Giza, Egypt.
⁷ Virginia Polytechnic Institute and State University, Blacksburg, VA, USA.
⁸ Broad Institute of MIT and Harvard, Cambridge, MA, USA. lage.kasper@mgh.harvard.edu.
⁹ Department of Surgery, Massachusetts General Hospital, Boston, MA, USA. lage.kasper@mgh.harvard.edu.
¹⁰ Harvard Medical School, Boston, MA, USA. lage.kasper@mgh.harvard.edu.

^# Contributed equally.

Abstract

Biases in data used to train machine learning (ML) models can inflate their prediction performance and confound our understanding of how and what they learn. Although biases are common in biological data, systematic auditing of ML models to identify and eliminate these biases is not a common practice when applying ML in the life sciences. Here we devise a systematic, principled, and general approach to audit ML models in the life sciences. We use this auditing framework to examine biases in three ML applications of therapeutic interest and identify unrecognized biases that hinder the ML process and result in substantially reduced model performance on new datasets. Ultimately, we show that ML models tend to learn primarily from data biases when there is insufficient signal in the data to learn from. We provide detailed protocols, guidelines, and examples of code to enable tailoring of the auditing framework to other biomedical applications.

Publication types

Research Support, N.I.H., Extramural
Research Support, U.S. Gov't, Non-P.H.S.

MeSH terms

Animals
Bias
Data Mining*
Databases, Protein
Histocompatibility Antigens / metabolism
Humans
Machine Learning*
Pharmaceutical Preparations / chemistry
Pharmaceutical Preparations / metabolism
Protein Binding
Protein Interaction Maps
Proteins / chemistry
Proteins / metabolism*
Proteome*
Proteomics*
Reproducibility of Results

Substances

Histocompatibility Antigens
Pharmaceutical Preparations
Proteins
Proteome

Grants and funding

R01 MH109903/MH/NIMH NIH HHS/United States