Use of Generative AI to Identify Helmet Status Among Patients With Micromobility-Related Injuries From Unstructured Clinical Notes

JAMA Netw Open. 2024 Aug 1;7(8):e2425981. doi: 10.1001/jamanetworkopen.2024.25981.

Abstract

Importance: Large language models (LLMs) have potential to increase the efficiency of information extraction from unstructured clinical notes in electronic medical records.

Objective: To assess the utility and reliability of an LLM, ChatGPT-4 (OpenAI), to analyze clinical narratives and identify helmet use status of patients injured in micromobility-related accidents.

Design, setting, and participants: This cross-sectional study used publicly available, deidentified 2019 to 2022 data from the US Consumer Product Safety Commission's National Electronic Injury Surveillance System, a nationally representative stratified probability sample of 96 hospitals in the US. Unweighted estimates of e-bike, bicycle, hoverboard, and powered scooter-related injuries that resulted in an emergency department visit were used. Statistical analysis was performed from November 2023 to April 2024.

Main outcomes and measures: Patient helmet status (wearing vs not wearing vs unknown) was extracted from clinical narratives using (1) a text string search using researcher-generated text strings and (2) the LLM by prompting the system with low-, intermediate-, and high-detail prompts. The level of agreement between the 2 approaches across all 3 prompts was analyzed using Cohen κ test statistics. Fleiss κ was calculated to measure the test-retest reliability of the high-detail prompt across 5 new chat sessions and days. Performance statistics were calculated by comparing results from the high-detail prompt to classifications of helmet status generated by researchers reading the clinical notes (ie, a criterion standard review).

Results: Among 54 569 clinical notes, moderate (Cohen κ = 0.74 [95% CI, 0.73-0.75) and weak (Cohen κ = 0.53 [95% CI, 0.52-0.54]) agreement were found between the text string-search approach and the LLM for the low- and intermediate-detail prompts, respectively. The high-detail prompt had almost perfect agreement (κ = 1.00 [95% CI, 1.00-1.00]) but required the greatest amount of time to complete. The LLM did not perfectly replicate its analyses across new sessions and days (Fleiss κ = 0.91 across 5 trials; P < .001). The LLM often hallucinated and was consistent in replicating its hallucinations. It also showed high validity compared with the criterion standard (n = 400; κ = 0.98 [95% CI, 0.96-1.00]).

Conclusions and relevance: This study's findings suggest that although there are efficiency gains for using the LLM to extract information from clinical notes, the inadequate reliability compared with a text string-search approach, hallucinations, and inconsistent performance significantly hinder the potential of the currently available LLM.

MeSH terms

  • Accidents, Traffic / statistics & numerical data
  • Adolescent
  • Adult
  • Cross-Sectional Studies
  • Electronic Health Records / statistics & numerical data
  • Female
  • Head Protective Devices* / statistics & numerical data
  • Humans
  • Male
  • Middle Aged
  • Reproducibility of Results
  • United States / epidemiology
  • Young Adult