Crowdsourcing with Enhanced Data Quality Assurance: An Efficient Approach to Mitigate Resource Scarcity Challenges in Training Large Language Models for Healthcare

Prosanta Barai; Gondy Leroy; Prakash Bisht; Joshua M Rothman; Sumi Lee; Jennifer Andrews; Sydney A Rice; Arif Ahmed

Crowdsourcing with Enhanced Data Quality Assurance: An Efficient Approach to Mitigate Resource Scarcity Challenges in Training Large Language Models for Healthcare

AMIA Jt Summits Transl Sci Proc. 2024 May 31:2024:75-84. eCollection 2024.

Authors

Prosanta Barai¹, Gondy Leroy¹, Prakash Bisht¹, Joshua M Rothman², Sumi Lee¹, Jennifer Andrews¹, Sydney A Rice¹, Arif Ahmed¹

Affiliations

¹ The University of Arizona, Tucson 85721, U.S.A.
² UC San Diego Division of Academic General Pediatrics, USA.

PMID: 38827063
PMCID: PMC11141838

Abstract

Large Language Models (LLMs) have demonstrated immense potential in artificial intelligence across various domains, including healthcare. However, their efficacy is hindered by the need for high-quality labeled data, which is often expensive and time-consuming to create, particularly in low-resource domains like healthcare. To address these challenges, we propose a crowdsourcing (CS) framework enriched with quality control measures at the pre-, real-time-, and post-data gathering stages. Our study evaluated the effectiveness of enhancing data quality through its impact on LLMs (Bio-BERT) for predicting autism-related symptoms. The results show that real-time quality control improves data quality by 19% compared to pre-quality control. Fine-tuning Bio-BERT using crowdsourced data generally increased recall compared to the Bio-BERT baseline but lowered precision. Our findings highlighted the potential of crowdsourcing and quality control in resource-constrained environments and offered insights into optimizing healthcare LLMs for informed decision-making and improved patient care.