Exploring a method for extracting concerns of multiple breast cancer patients in the domain of patient narratives using BERT and its optimization by domain adaptation using masked language modeling

Satoshi Watabe; Tomomi Watanabe; Shuntaro Yada; Eiji Aramaki; Hiroshi Yajima; Hayato Kizaki; Satoko Hori

doi:10.1371/journal.pone.0305496

Exploring a method for extracting concerns of multiple breast cancer patients in the domain of patient narratives using BERT and its optimization by domain adaptation using masked language modeling

PLoS One. 2024 Sep 6;19(9):e0305496. doi: 10.1371/journal.pone.0305496. eCollection 2024.

Authors

Satoshi Watabe¹, Tomomi Watanabe¹, Shuntaro Yada², Eiji Aramaki², Hiroshi Yajima³, Hayato Kizaki¹, Satoko Hori¹

Affiliations

¹ Division of Drug Informatics, Keio University Faculty of Pharmacy, Tokyo, Japan.
² Nara Institute of Science and Technology, Nara, Japan.
³ Mediaid Corporation, Tokyo, Japan.

Abstract

Narratives posted on the internet by patients contain a vast amount of information about various concerns. This study aimed to extract multiple concerns from interviews with breast cancer patients using the natural language processing (NLP) model bidirectional encoder representations from transformers (BERT). A total of 508 interview transcriptions of breast cancer patients written in Japanese were labeled with five types of concern labels: "treatment," "physical," "psychological," "work/financial," and "family/friends." The labeled texts were used to create a multi-label classifier by fine-tuning a pre-trained BERT model. Prior to fine-tuning, we also created several classifiers with domain adaptation using (1) breast cancer patients' blog articles and (2) breast cancer patients' interview transcriptions. The performance of the classifiers was evaluated in terms of precision through 5-fold cross-validation. The multi-label classifiers with only fine-tuning had precision values of over 0.80 for "physical" and "work/financial" out of the five concerns. On the other hand, precision for "treatment" was low at approximately 0.25. However, for the classifiers using domain adaptation, the precision of this label took a range of 0.40-0.51, with some cases improving by more than 0.2. This study showed combining domain adaptation with a multi-label classifier on target data made it possible to efficiently extract multiple concerns from interviews.

Copyright: © 2024 Watabe et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

MeSH terms

Breast Neoplasms* / psychology
Female
Humans
Narration
Natural Language Processing*

Grants and funding

This work was supported by JSPS KAKENHI Grant Number 21H03170 and JST CREST Grant Number JPMJCR22N1, Japan. For more information on JSPS and JST, please visit https://www.jsps.go.jp and https://www.jst.go.jp, respectively. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.