Objectives: Evaluating occupational exposures in population-based case-control studies often requires exposure assessors to review each study participant's reported occupational information job-by-job to derive exposure estimates. Although such assessments likely have underlying decision rules, they usually lack transparency, are time consuming and have uncertain reliability and validity. We aimed to identify the underlying rules to enable documentation, review and future use of these expert-based exposure decisions.
Methods: Classification and regression trees (CART, predictions from a single tree) and random forests (predictions from many trees) were used to identify the underlying rules from the questionnaire responses, and an expert's exposure assignments for occupational diesel exhaust exposure for several metrics: binary exposure probability and ordinal exposure probability, intensity and frequency. Data were split into training (n=10 488 jobs), testing (n=2247) and validation (n=2248) datasets.
Results: The CART and random forest models' predictions agreed with 92-94% of the expert's binary probability assignments. For ordinal probability, intensity and frequency metrics, the two models extracted decision rules more successfully for unexposed and highly exposed jobs (86-90% and 57-85%, respectively) than for low or medium exposed jobs (7-71%).
Conclusions: CART and random forest models extracted decision rules and accurately predicted an expert's exposure decisions for the majority of jobs, and identified questionnaire response patterns that would require further expert review if the rules were applied to other jobs in the same or different study. This approach makes the exposure assessment process in case-control studies more transparent, and creates a mechanism to efficiently replicate exposure decisions in future studies.