Comparative Assessment of Otolaryngology Knowledge Among Large Language Models

Laryngoscope. 2024 Sep 21. doi: 10.1002/lary.31781. Online ahead of print.

Abstract

Objective: The purpose of this study was to evaluate the performance of advanced large language models from OpenAI (GPT-3.5 and GPT-4), Google (PaLM2 and MedPaLM), and an open source model from Meta (Llama3:70b) in answering clinical test multiple choice questions in the field of otolaryngology-head and neck surgery.

Methods: A dataset of 4566 otolaryngology questions was used; each model was provided a standardized prompt followed by a question. One hundred questions that were answered incorrectly by all models were further interrogated to gain insight into the causes of incorrect answers.

Results: GPT4 was the most accurate, correctly answering 3520 of 4566 questions (77.1%). MedPaLM correctly answered 3223 of 4566 (70.6%) questions, while llama3:70b, GPT3.5, and PaLM2 were correct on 3052 of 4566 (66.8%), 2672 of 4566 (58.5%), and 2583 of 4566 (56.5%) questions. Three hundred and sixty-nine questions were answered incorrectly by all models. Prompts to provide reasoning improved accuracy in all models: GPT4 changed from incorrect to correct answer 31% of the time, while GPT3.5, Llama3, PaLM2, and MedPaLM corrected their responses 25%, 18%, 19%, and 17% of the time, respectively.

Conclusion: Large language models vary in their understanding of otolaryngology-specific clinical knowledge. OpenAI's GPT4 has a strong understanding of core concepts as well as detailed information in the field of otolaryngology. Its baseline understanding in this field makes it well-suited to serve in roles related to head and neck surgery education provided that the appropriate precautions are taken and potential limitations are understood.

Level of evidence: N/A Laryngoscope, 2024.

Keywords: AI; ENT; artificial intelligence; education; large language models; otolaryngology.