Applying active learning to high-throughput phenotyping algorithms for electronic health records data

Yukun Chen; Robert J Carroll; Eugenia R McPeek Hinz; Anushi Shah; Anne E Eyler; Joshua C Denny; Hua Xu

doi:10.1136/amiajnl-2013-001945

Applying active learning to high-throughput phenotyping algorithms for electronic health records data

J Am Med Inform Assoc. 2013 Dec;20(e2):e253-9. doi: 10.1136/amiajnl-2013-001945. Epub 2013 Jul 13.

Authors

Yukun Chen¹, Robert J Carroll, Eugenia R McPeek Hinz, Anushi Shah, Anne E Eyler, Joshua C Denny, Hua Xu

Affiliation

¹ Department of Biomedical Informatics, Vanderbilt University, School of Medicine, Nashville, Tennessee, USA.

Abstract

Objectives: Generalizable, high-throughput phenotyping methods based on supervised machine learning (ML) algorithms could significantly accelerate the use of electronic health records data for clinical and translational research. However, they often require large numbers of annotated samples, which are costly and time-consuming to review. We investigated the use of active learning (AL) in ML-based phenotyping algorithms.

Methods: We integrated an uncertainty sampling AL approach with support vector machines-based phenotyping algorithms and evaluated its performance using three annotated disease cohorts including rheumatoid arthritis (RA), colorectal cancer (CRC), and venous thromboembolism (VTE). We investigated performance using two types of feature sets: unrefined features, which contained at least all clinical concepts extracted from notes and billing codes; and a smaller set of refined features selected by domain experts. The performance of the AL was compared with a passive learning (PL) approach based on random sampling.

Results: Our evaluation showed that AL outperformed PL on three phenotyping tasks. When unrefined features were used in the RA and CRC tasks, AL reduced the number of annotated samples required to achieve an area under the curve (AUC) score of 0.95 by 68% and 23%, respectively. AL also achieved a reduction of 68% for VTE with an optimal AUC of 0.70 using refined features. As expected, refined features improved the performance of phenotyping classifiers and required fewer annotated samples.

Conclusions: This study demonstrated that AL can be useful in ML-based phenotyping methods. Moreover, AL and feature engineering based on domain knowledge could be combined to develop efficient and generalizable phenotyping methods.

Keywords: Active Learning; Electronic Health Records; Machine Learning; Natural Language Processing; Phenotyping Algorithm.

Publication types

Research Support, N.I.H., Extramural
Research Support, Non-U.S. Gov't

MeSH terms

Algorithms*
Artificial Intelligence*
Electronic Health Records*
Genetic Association Studies
Humans
Phenotype*
Support Vector Machine

Abstract

Publication types

MeSH terms

Grants and funding