Accuracy and Readability of ChatGPT on Potential Complications of Interventional Radiology Procedures: AI-Powered Patient Interviewing

Acad Radiol. 2024 Nov 16:S1076-6332(24)00791-8. doi: 10.1016/j.acra.2024.10.028. Online ahead of print.

Abstract

Rationale and objectives: It is crucial to inform the patient about potential complications and obtain consent before interventional radiology procedures. In this study, we investigated the accuracy, reliability, and readability of the information provided by ChatGPT-4 about potential complications of interventional radiology procedures.

Materials and methods: Potential major and minor complications of 25 different interventional radiology procedures (8 non-vascular, 17 vascular) were asked to ChatGPT-4 chatbot. The responses were evaluated by two experienced interventional radiologists (>25 years and 10 years of experience) using a 5-point Likert scale according to Cardiovascular and Interventional Radiological Society of Europe guidelines. The correlation between the two interventional radiologists' scoring was evaluated by the Wilcoxon signed-rank test, Intraclass Correlation Coefficient (ICC), and Pearson correlation coefficient (PCC). In addition, readability and complexity were quantitatively assessed using the Flesch-Kincaid Grade Level, Flesch Reading Ease scores, and Simple Measure of Gobbledygook (SMOG) index.

Results: Interventional radiologist 1 (IR1) and interventional radiologist 2 (IR2) gave 104 and 109 points, respectively, out of a potential 125 points for the total of all procedures. There was no statistically significant difference between the total scores of the two IRs (p = 0.244). The IRs demonstrated high agreement across all procedure ratings (ICC=0.928). Both IRs scored 34 out of 40 points for the eight non-vascular procedures. 17 vascular procedures received 70 points out of 85 from IR1 and 75 from IR2. The agreement between the two observers' assessments was good, with PCC values of 0.908 and 0.896 for non-vascular and vascular procedures, respectively. Readability levels were overall low. The mean Flesch-Kincaid Grade Level, Flesch Reading Ease scores, and SMOG index were 12.51 ± 1.14 (college level) 30.27 ± 8.38 (college level), and 14.46 ± 0.76 (college level), respectively. There was no statistically significant difference in readability between non-vascular and vascular procedures (p = 0.16).

Conclusion: ChatGPT-4 demonstrated remarkable performance, highlighting its potential to enhance accessibility to information about interventional radiology procedures and support the creation of educational materials for patients. Based on the findings of our study, while ChatGPT provides accurate information and shows no evidence of hallucinations, it is important to emphasize that a high level of education and health literacy are required to fully comprehend its responses.

Keywords: ChatGPT; Complication; Interventional radiology; Patient interviewing; Readability.