Response accuracy of GPT-4 across languages: insights from an expert-level diagnostic radiology examination in Japan

Jpn J Radiol. 2024 Oct 28. doi: 10.1007/s11604-024-01673-6. Online ahead of print.

Abstract

Purpose: This study aims to investigate the effects of language selection and translation quality on Generative Pre-trained Transformer-4 (GPT-4)'s response accuracy to expert-level diagnostic radiology questions.

Materials and methods: We analyzed 146 diagnostic radiology questions from the Japan Radiology Board Examination (2020-2022), with consensus answers provided by two board-certified radiologists. The questions, originally in Japanese, were translated into English by GPT-4 and DeepL and into German and Chinese by GPT-4. Responses were generated by GPT-4 five times per question set per language. Response accuracy was compared between languages using one-way ANOVA with Bonferroni correction or the Mann-Whitney U test. Scores on selected English questions translated by a professional service and GPT-4 were also compared. The impact of translation quality on GPT-4's performance was assessed by linear regression analysis.

Results: The median scores (interquartile range) for the 146 questions were 70 (68-72) (Japanese), 89 (84.5-95.5) (GPT-4 English), 64 (55.5-67) (Chinese), and 56 (46.5-67.5) (German). Significant differences were found between Japanese and English (p = 0.002) and between Japanese and German (p = 0.022). The counts of correct responses across five attempts for each question were significantly associated with the quality of translation into English (GPT-4, DeepL) and German (GPT-4). In a subset of 31 questions where English translations yielded fewer correct responses than Japanese originals, professionally translated questions yielded better scores than those translated by GPT-4 (13 versus 8 points, p = 0.0079).

Conclusion: GPT-4 exhibits higher accuracy when responding to English-translated questions compared to original Japanese questions, a trend not observed with German or Chinese translations. Accuracy improves with higher-quality English translations, underscoring the importance of high-quality translations in improving GPT-4's response accuracy to diagnostic radiology questions in non-English languages and aiding non-native English speakers in obtaining accurate answers from large language models.

Keywords: GPT-4; Linguistic variation; Prompt; Radiology board examination; Translation quality.