Performance evaluation of ChatGPT in detecting diagnostic errors and their contributing factors: an analysis of 545 case reports of diagnostic errors

Yukinori Harada; Tomoharu Suzuki; Taku Harada; Tetsu Sakamoto; Kosuke Ishizuka; Taiju Miyagami; Ren Kawamura; Kotaro Kunitomo; Hiroyuki Nagano; Taro Shimizu; Takashi Watari

doi:10.1136/bmjoq-2023-002654

Performance evaluation of ChatGPT in detecting diagnostic errors and their contributing factors: an analysis of 545 case reports of diagnostic errors

BMJ Open Qual. 2024 Jun 3;13(2):e002654. doi: 10.1136/bmjoq-2023-002654.

Authors

Affiliations

¹ Department of Diagnostic and Generalist Medicine, Dokkyo Medical University, Shimotsuga-gun, Tochigi, Japan yuki.gym23@gmail.com.
² Urasoe General Hospital, Urasoe, Okinawa, Japan.
³ Department of Diagnostic and Generalist Medicine, Dokkyo Medical University, Shimotsuga-gun, Tochigi, Japan.
⁴ Nerima Hikarigaoka Hospital, Nerima-ku, Tokyo, Japan.
⁵ Yokohama City University School of Medicine Graduate School of Medicine, Yokohama, Kanagawa, Japan.
⁶ Department of General Medicine, Faculty of Medicine, Juntendo University, Bunkyo-ku, Tokyo, Japan.
⁷ NHO Kumamoto Medical Center, Kumamoto, Kumamoto, Japan.
⁸ Department of General Internal Medicine, Tenri Hospital, Tenri, Nara, Japan.
⁹ Integrated Clinical Education Center, Kyoto University Hospital, Kyoto, Kyoto, Japan.

Abstract

Background: Manual chart review using validated assessment tools is a standardised methodology for detecting diagnostic errors. However, this requires considerable human resources and time. ChatGPT, a recently developed artificial intelligence chatbot based on a large language model, can effectively classify text based on suitable prompts. Therefore, ChatGPT can assist manual chart reviews in detecting diagnostic errors.

Objective: This study aimed to clarify whether ChatGPT could correctly detect diagnostic errors and possible factors contributing to them based on case presentations.

Methods: We analysed 545 published case reports that included diagnostic errors. We imputed the texts of case presentations and the final diagnoses with some original prompts into ChatGPT (GPT-4) to generate responses, including the judgement of diagnostic errors and contributing factors of diagnostic errors. Factors contributing to diagnostic errors were coded according to the following three taxonomies: Diagnosis Error Evaluation and Research (DEER), Reliable Diagnosis Challenges (RDC) and Generic Diagnostic Pitfalls (GDP). The responses on the contributing factors from ChatGPT were compared with those from physicians.

Results: ChatGPT correctly detected diagnostic errors in 519/545 cases (95%) and coded statistically larger numbers of factors contributing to diagnostic errors per case than physicians: DEER (median 5 vs 1, p<0.001), RDC (median 4 vs 2, p<0.001) and GDP (median 4 vs 1, p<0.001). The most important contributing factors of diagnostic errors coded by ChatGPT were 'failure/delay in considering the diagnosis' (315, 57.8%) in DEER, 'atypical presentation' (365, 67.0%) in RDC, and 'atypical presentation' (264, 48.4%) in GDP.

Conclusion: ChatGPT accurately detects diagnostic errors from case presentations. ChatGPT may be more sensitive than manual reviewing in detecting factors contributing to diagnostic errors, especially for 'atypical presentation'.

Keywords: Artificial Intelligence; Chart review methodologies; Diagnostic errors.

MeSH terms

Artificial Intelligence / standards
Diagnostic Errors* / statistics & numerical data
Humans