Evolution of publicly available large language models for complex decision-making in breast cancer care

Sebastian Griewing; Johannes Knitza; Jelena Boekhoff; Christoph Hillen; Fabian Lechner; Uwe Wagner; Markus Wallwiener; Sebastian Kuhn

doi:10.1007/s00404-024-07565-4

Evolution of publicly available large language models for complex decision-making in breast cancer care

Arch Gynecol Obstet. 2024 Jul;310(1):537-550. doi: 10.1007/s00404-024-07565-4. Epub 2024 May 29.

Authors

Sebastian Griewing^{1

2

3}, Johannes Knitza⁴, Jelena Boekhoff⁵, Christoph Hillen^{6

7}, Fabian Lechner⁸, Uwe Wagner^{5

7}, Markus Wallwiener^{9

7}, Sebastian Kuhn⁴

Affiliations

¹ Institute for Digital Medicine, Philipps-University Marburg, Marburg, Germany. s.griewing@uni-marburg.de.
² Department of Gynecology and Obstetrics, Philipps-University Marburg, Marburg, Germany. s.griewing@uni-marburg.de.
³ Kommission Digitale Medizin, Deutsche Gesellschaft für Gynäkologie und Geburtshilfe, Berlin, Germany. s.griewing@uni-marburg.de.
⁴ Institute for Digital Medicine, Philipps-University Marburg, Marburg, Germany.
⁵ Department of Gynecology and Obstetrics, Philipps-University Marburg, Marburg, Germany.
⁶ Department of Gynecology and Gynecologic Oncology, University Medical Center Hamburg-Eppendorf, Hamburg, Germany.
⁷ Kommission Digitale Medizin, Deutsche Gesellschaft für Gynäkologie und Geburtshilfe, Berlin, Germany.
⁸ Institute for Artificial Intelligence in Medicine, Philipps-University Marburg, Marburg, Germany.
⁹ Department of Gynecology and Obstetrics, Martin-Luther University Halle-Wittenberg, Halle, Germany.

Abstract

Purpose: This study investigated the concordance of five different publicly available Large Language Models (LLM) with the recommendations of a multidisciplinary tumor board regarding treatment recommendations for complex breast cancer patient profiles.

Methods: Five LLM, including three versions of ChatGPT (version 4 and 3.5, with data access until September 3021 and January 2022), Llama2, and Bard were prompted to produce treatment recommendations for 20 complex breast cancer patient profiles. LLM recommendations were compared to the recommendations of a multidisciplinary tumor board (gold standard), including surgical, endocrine and systemic treatment, radiotherapy, and genetic testing therapy options.

Results: GPT4 demonstrated the highest concordance (70.6%) for invasive breast cancer patient profiles, followed by GPT3.5 September 2021 (58.8%), GPT3.5 January 2022 (41.2%), Llama2 (35.3%) and Bard (23.5%). Including precancerous lesions of ductal carcinoma in situ, the identical ranking was reached with lower overall concordance for each LLM (GPT4 60.0%, GPT3.5 September 2021 50.0%, GPT3.5 January 2022 35.0%, Llama2 30.0%, Bard 20.0%). GPT4 achieved full concordance (100%) for radiotherapy. Lowest alignment was reached in recommending genetic testing, demonstrating a varying concordance (55.0% for GPT3.5 January 2022, Llama2 and Bard up to 85.0% for GPT4).

Conclusion: This early feasibility study is the first to compare different LLM in breast cancer care with regard to changes in accuracy over time, i.e., with access to more data or through technological upgrades. Methodological advancement, i.e., the optimization of prompting techniques, and technological development, i.e., enabling data input control and secure data processing, are necessary in the preparation of large-scale and multicenter studies to provide evidence on their safe and reliable clinical application. At present, safe and evidenced use of LLM in clinical breast cancer care is not yet feasible.

Keywords: Artificial intelligence; Breast cancer; ChatGPT; Large language models; Tumor board.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Breast Neoplasms* / genetics
Breast Neoplasms* / therapy
Clinical Decision-Making
Decision Support Techniques
Female
Humans