Background: Identifying patients at risk of not achieving meaningful gains in long-term postsurgical patient-reported outcome measures (PROMs) is important for improving patient monitoring and facilitating presurgical decision support. Machine learning may help automatically select and weigh many predictors to create models that maximize predictive power. However, these techniques are underused among studies of total joint arthroplasty (TJA) patients, particularly those exploring changes in postsurgical PROMs. QUESTION/PURPOSES: (1) To evaluate whether machine learning algorithms, applied to hospital registry data, could predict patients who would not achieve a minimally clinically important difference (MCID) in four PROMs 2 years after TJA; (2) to explore how predictive ability changes as more information is included in modeling; and (3) to identify which variables drive the predictive power of these models.
Methods: Data from a single, high-volume institution's TJA registry were used for this study. We identified 7239 hip and 6480 knee TJAs between 2007 and 2012, which, for at least one PROM, patients had completed both baseline and 2-year followup surveys (among 19,187 TJAs in our registry and 43,313 total TJAs). In all, 12,203 registry TJAs had valid SF-36 physical component scores (PCS) and mental component scores (MCS) at baseline and 2 years; 7085 and 6205 had valid Hip and Knee Disability and Osteoarthritis Outcome Scores for joint replacement (HOOS JR and KOOS JR scores), respectively. Supervised machine learning refers to a class of algorithms that links a mapping of inputs to an output based on many input-output examples. We trained three of the most popular such algorithms (logistic least absolute shrinkage and selection operator (LASSO), random forest, and linear support vector machine) to predict 2-year postsurgical MCIDs. We incrementally considered predictors available at four time points: (1) before the decision to have surgery, (2) before surgery, (3) before discharge, and (4) immediately after discharge. We evaluated the performance of each model using area under the receiver operating characteristic (AUROC) statistics on a validation sample composed of a random 20% subsample of TJAs excluded from modeling. We also considered abbreviated models that only used baseline PROMs and procedure as predictors (to isolate their predictive power). We further directly evaluated which variables were ranked by each model as most predictive of 2-year MCIDs.
Results: The three machine learning algorithms performed in the poor-to-good range for predicting 2-year MCIDs, with AUROCs ranging from 0.60 to 0.89. They performed virtually identically for a given PROM and time point. AUROCs for the logistic LASSO models for predicting SF-36 PCS 2-year MCIDs at the four time points were: 0.69, 0.78, 0.78, and 0.78, respectively; for SF-36 MCS 2-year MCIDs, AUROCs were: 0.63, 0.89, 0.89, and 0.88; for HOOS JR 2-year MCIDs: 0.67, 0.78, 0.77, and 0.77; for KOOS JR 2-year MCIDs: 0.61, 0.75, 0.75, and 0.75. Before-surgery models performed in the fair-to-good range and consistently ranked the associated baseline PROM as among the most important predictors. Abbreviated LASSO models performed worse than the full before-surgery models, though they retained much of the predictive power of the full before-surgery models.
Conclusions: Machine learning has the potential to improve clinical decision-making and patient care by helping to prioritize resources for postsurgical monitoring and informing presurgical discussions of likely outcomes of TJA. Applied to presurgical registry data, such models can predict, with fair-to-good ability, 2-year postsurgical MCIDs. Although we report all parameters of our best-performing models, they cannot simply be applied off-the-shelf without proper testing. Our analyses indicate that machine learning holds much promise for predicting orthopaedic outcomes. LEVEL OF EVIDENCE: Level III, diagnostic study.