Generalizability of lesion detection and segmentation when ScaleNAS is trained on a large multi-organ dataset and validated in the liver

Med Phys. 2024 Nov 22. doi: 10.1002/mp.17504. Online ahead of print.

Abstract

Background: Tumor assessment through imaging is crucial for diagnosing and treating cancer. Lesions in the liver, a common site for metastatic disease, are particularly challenging to accurately detect and segment. This labor-intensive task is subject to individual variation, which drives interest in automation using artificial intelligence (AI).

Purpose: Evaluate AI for lesion detection and lesion segmentation using CT in the context of human performance on the same task. Use internal testing to determine how an AI-developed model (ScaleNAS) trained on lesions in multiple organs performs when tested specifically on liver lesions in a dataset integrating real-world and clinical trial data. Use external testing to evaluate whether ScaleNAS's performance generalizes to publicly available colorectal liver metastases (CRLM) from The Cancer Imaging Archive (TCIA).

Methods: The CUPA study dataset included patients whose CT scan of chest, abdomen, or pelvis at Columbia University between 2010-2020 indicated solid tumors (CUIMC, n = 5011) and from two clinical trials in metastatic colorectal cancer, PRIME (n = 1183) and Amgen (n = 463). Inclusion required ≥1 measurable lesion; exclusion criteria eliminated 1566 patients. Data were divided at the patient level into training (n = 3996), validation (n = 570), and testing (n = 1529) sets. To create the reference standard for training and validation, each case was annotated by one of six radiologists, randomly assigned, who marked the CUPA lesions without access to any previous annotations. For internal testing we refined the CUPA test set to contain only patients who had liver lesions (n = 525) and formed an enhanced reference standard through expert consensus reviewing prior annotations. For external testing, TCIA-CRLM (n = 197) formed the test set. The reference standard for TCIA-CRLM was formed by consensus review of the original annotation and contours by two new radiologists. Metrics for lesion detection were sensitivity and false positives. Lesion segmentation was assessed with median Dice coefficient, under-segmentation ratio (USR), and over-segmentation ratio (OSR). Subgroup analysis examined the influence of lesion size ≥ 10 mm (measurable by RECIST1.1) versus all lesions (important for early identification of disease progression).

Results: ScaleNAS trained on all lesions achieved sensitivity of 71.4% and Dice of 70.2% for liver lesions in the CUPA internal test set (3,495 lesions) and sensitivity of 68.2% and Dice 64.2% in the TCIA-CRLM external test set (638 lesions). Human radiologists had mean sensitivity of 53.5% and Dice of 73.9% in CUPA and sensitivity of 84.1% and Dice of 88.4% in TCIA-CRLM. Performance improved for ScaleNAS and radiologists in the subgroup of lesions that excluded sub-centimeter lesions.

Conclusions: Our study presents the first evaluation of ScaleNAS in medical imaging, demonstrating its liver lesion detection and segmentation performance across diverse datasets. Using consensus reference standards from multiple radiologists, we addressed inter-observer variability and contributed to consistency in lesion annotation. While ScaleNAS does not surpass radiologists in performance, it offers fast and reliable results with potential utility in providing initial contours for radiologists. Future work will extend this model to lung and lymph node lesions, ultimately aiming to enhance clinical applications by generalizing detection and segmentation across tissue types.

Keywords: deep learning; lesion detection; lesion segmentation.

Grants and funding