"This Is a Quiz" Premise Input: A Key to Unlocking Higher Diagnostic Accuracy in Large Language Models

Yusuke Asari; Ryo Kurokawa; Yuki Sonoda; Akifumi Hagiwara; Jun Kamohara; Takahiro Fukushima; Wataru Gonoi; Osamu Abe

doi:10.7759/cureus.72383

"This Is a Quiz" Premise Input: A Key to Unlocking Higher Diagnostic Accuracy in Large Language Models

Cureus. 2024 Oct 25;16(10):e72383. doi: 10.7759/cureus.72383. eCollection 2024 Oct.

Authors

Yusuke Asari¹, Ryo Kurokawa¹, Yuki Sonoda¹, Akifumi Hagiwara¹, Jun Kamohara¹, Takahiro Fukushima¹, Wataru Gonoi¹, Osamu Abe¹

Affiliation

¹ Radiology, The University of Tokyo, Tokyo, JPN.

Abstract

Purpose Large language models (LLMs) are neural network models that are trained on large amounts of textual data, showing promising performance in various fields. In radiology, studies have demonstrated the strong performance of LLMs in diagnostic imaging quiz cases. However, the inherent differences in prior probabilities of a final diagnosis between clinical and quiz cases pose challenges for LLMs, as LLMs had not been informed about the quiz nature in previous literature, while human physicians can optimize the diagnosis, consciously or unconsciously, depending on the situation. The present study aimed to test the hypothesis that notifying LLMs about the quiz nature might improve diagnostic accuracy. Methods One hundred and fifty consecutive cases from the "Case of the Week" radiological diagnostic quiz case series on the American Journal of Neuroradiology website were analyzed. GPT-4o and Claude 3.5 Sonnet were used to generate the top three differential diagnoses based on the textual clinical history and figure legends. The prompts included or excluded information about the quiz nature for both models. Two radiologists evaluated the accuracy of the diagnoses. McNemar's test assessed differences in correct response rates. Results Informing the quiz nature improved the diagnostic performance of both models. Specifically, Claude 3.5 Sonnet's primary diagnosis and GPT-4o's top 3 differential diagnoses significantly improved when the quiz nature was informed. Conclusion Informing the quiz nature of cases significantly enhances LLMs' diagnostic performances. This insight into LLMs' capabilities could inform future research and applications, highlighting the importance of context in optimizing LLM-based diagnostics.

Keywords: bayes’ theorem; claude 3.5 sonnet; gpt-4o; large language model; prompt engineering.