Inter-Rater Reliability of Unstructured Text Labeling: Artificially vs. Naturally Intelligent Approaches

Gleb Danilov; Alexandra Kosyrkova; Maria Shults; Semen Melchenko; Tatyana Tsukanova; Michael Shifrin; Alexander Potapov

doi:10.3233/SHTI210132

Inter-Rater Reliability of Unstructured Text Labeling: Artificially vs. Naturally Intelligent Approaches

Stud Health Technol Inform. 2021 May 27:281:118-122. doi: 10.3233/SHTI210132.

Authors

Gleb Danilov¹, Alexandra Kosyrkova¹, Maria Shults¹, Semen Melchenko¹, Tatyana Tsukanova¹, Michael Shifrin¹, Alexander Potapov¹

Affiliation

¹ Laboratory of Biomedical Informatics and Artificial Intelligence, National Medical Research Center for Neurosurgery named after N.N. Burdenko, Moscow, Russian Federation.

PMID: 34042717
DOI: 10.3233/SHTI210132

Abstract

Unstructured medical text labeling technologies are expected to be highly demanded since the interest in artificial intelligence and natural language processing arises in the medical domain. Our study aimed to assess the agreement between experts who judged on the fact of pulmonary embolism (PE) in neurosurgical cases retrospectively based on electronic health records and assess the utility of the machine learning approach to automate this process. We observed a moderate agreement between 3 independent raters on PE detection (Light's kappa = 0.568, p = 0). Labeling sentences with the method we proposed earlier might improve the machine learning results (accuracy = 0.97, ROC AUC = 0.98) even in those cases that could not be agreed between 3 independent raters. Medical text labeling techniques might be more efficient when strict rules and semi-automated approaches are implemented. Machine learning might be a good option for unstructured text labeling when the reliability of textual data is properly addressed. This project was supported by the RFBR grant 18-29-22085.

Keywords: Machine Learning; Natural Language Processing; Neurosurgery; Pulmonary Embolism.

MeSH terms

Artificial Intelligence*
Electronic Health Records
Machine Learning
Natural Language Processing*
Reproducibility of Results
Retrospective Studies