Calibration of medical diagnostic classifier scores to the probability of disease

Weijie Chen; Berkman Sahiner; Frank Samuelson; Aria Pezeshk; Nicholas Petrick

doi:10.1177/0962280216661371

Calibration of medical diagnostic classifier scores to the probability of disease

Stat Methods Med Res. 2018 May;27(5):1394-1409. doi: 10.1177/0962280216661371. Epub 2016 Aug 8.

Authors

Weijie Chen¹, Berkman Sahiner¹, Frank Samuelson¹, Aria Pezeshk¹, Nicholas Petrick¹

Affiliation

¹ Office of Science and Engineering Laboratories, Center for Devices and Radiological Health, Food and Drug Administration, Silver Spring, USA.

Abstract

Scores produced by statistical classifiers in many clinical decision support systems and other medical diagnostic devices are generally on an arbitrary scale, so the clinical meaning of these scores is unclear. Calibration of classifier scores to a meaningful scale such as the probability of disease is potentially useful when such scores are used by a physician. In this work, we investigated three methods (parametric, semi-parametric, and non-parametric) for calibrating classifier scores to the probability of disease scale and developed uncertainty estimation techniques for these methods. We showed that classifier scores on arbitrary scales can be calibrated to the probability of disease scale without affecting their discrimination performance. With a finite dataset to train the calibration function, it is important to accompany the probability estimate with its confidence interval. Our simulations indicate that, when a dataset used for finding the transformation for calibration is also used for estimating the performance of calibration, the resubstitution bias exists for a performance metric involving the truth states in evaluating the calibration performance. However, the bias is small for the parametric and semi-parametric methods when the sample size is moderate to large (>100 per class).

Keywords: Calibration; classifier; probability of disease; rationality.

MeSH terms

Calibration*
Confidence Intervals
Diagnosis*
Disease / classification*
Humans
Probability*
Sample Size
Statistics as Topic*
Statistics, Nonparametric

Grants and funding

FD999999/Intramural FDA HHS/United States