Objective: Since the introduction of large language models (LLMs), near expert level performance in medical specialties such as radiology has been demonstrated. However, there is limited to no comparative information of model performance, accuracy, and reliability over time in these medical specialty domains. This study aims to evaluate and monitor the performance and internal reliability of LLMs in radiology over a three-month period.
Methods: LLMs (GPT-4, GPT-3.5, Claude, and Google Bard) were queried monthly from November 2023 to January 2024, utilizing ACR Diagnostic in Training Exam (DXIT) practice questions. Model overall accuracy and by subspecialty category was assessed over time. Internal consistency was evaluated through answer mismatch or intra-model discordance between trials.
Results: GPT-4 had the highest accuracy (78 ± 4.1 %), followed by Google Bard (73 ± 2.9 %), Claude (71 ± 1.5 %), and GPT-3.5 (63 ± 6.9 %). GPT-4 performed significantly better than GPT-3.5 (p = 0.031). Over time, GPT-4's accuracy trended down (82 % to 74 %), while Claude's accuracy increased (70 % to 73 %). Intra-model discordance rates decreased for all models, indicating improved response consistency. Performance varied by subspecialty, with significant differences in the Chest, Physics, Ultrasound, and Pediatrics sections. Models struggled with questions requiring detailed factual knowledge but performed better on broader interpretive questions.
Conclusion: LLMs, except GPT-3.5, performed above 70%, demonstrating substantial subject-specific knowledge. However, performance fluctuated over time, underscoring the need for continuous, radiology-specific standardized benchmarking metrics to gauge LLM reliability before clinical use. This study provides a foundational benchmark for future LLM performance evaluations in radiology.
Keywords: AI performance; ChatGPT; Claude; Google Bard; Large language models.
Copyright © 2024 Elsevier B.V. All rights reserved.