Practical considerations for specifying a super learner

Rachael V Phillips; Mark J van der Laan; Hana Lee; Susan Gruber

doi:10.1093/ije/dyad023

Practical considerations for specifying a super learner

Int J Epidemiol. 2023 Aug 2;52(4):1276-1285. doi: 10.1093/ije/dyad023.

Authors

Rachael V Phillips¹, Mark J van der Laan¹, Hana Lee², Susan Gruber³

Affiliations

¹ Division of Biostatistics, School of Public Health, University of California at Berkeley, Berkeley, California, United States.
² Office of Biostatistics, Center for Drug Evaluation and Research, United States Food and Drug Administration, Silver Spring, Maryland, United States.
³ Putnam Data Sciences, LLC, Cambridge, Massachusetts, United States.

PMID: 36905602
DOI: 10.1093/ije/dyad023

Abstract

Common tasks encountered in epidemiology, including disease incidence estimation and causal inference, rely on predictive modelling. Constructing a predictive model can be thought of as learning a prediction function (a function that takes as input covariate data and outputs a predicted value). Many strategies for learning prediction functions from data (learners) are available, from parametric regressions to machine learning algorithms. It can be challenging to choose a learner, as it is impossible to know in advance which one is the most suitable for a particular dataset and prediction task. The super learner (SL) is an algorithm that alleviates concerns over selecting the one 'right' learner by providing the freedom to consider many, such as those recommended by collaborators, used in related research or specified by subject-matter experts. Also known as stacking, SL is an entirely prespecified and flexible approach for predictive modelling. To ensure the SL is well specified for learning the desired prediction function, the analyst does need to make a few important choices. In this educational article, we provide step-by-step guidelines for making these decisions, walking the reader through each of them and providing intuition along the way. In doing so, we aim to empower the analyst to tailor the SL specification to their prediction task, thereby ensuring their SL performs as well as possible. A flowchart provides a concise, easy-to-follow summary of key suggestions and heuristics, based on our accumulated experience and guided by SL optimality theory.

Keywords: Super learner; causal inference; disease epidemiology; ensemble machine learning; health outcomes; model validation; prediction; risk assessment; stacking; statistical data analysis.

Publication types

Research Support, U.S. Gov't, P.H.S.

MeSH terms

Algorithms*
Humans
Machine Learning*