Cost-aware active learning for named entity recognition in clinical text

J Am Med Inform Assoc. 2019 Nov 1;26(11):1314-1322. doi: 10.1093/jamia/ocz102.

Abstract

Objective: Active Learning (AL) attempts to reduce annotation cost (ie, time) by selecting the most informative examples for annotation. Most approaches tacitly (and unrealistically) assume that the cost for annotating each sample is identical. This study introduces a cost-aware AL method, which simultaneously models both the annotation cost and the informativeness of the samples and evaluates both via simulation and user studies.

Materials and methods: We designed a novel, cost-aware AL algorithm (Cost-CAUSE) for annotating clinical named entities; we first utilized lexical and syntactic features to estimate annotation cost, then we incorporated this cost measure into an existing AL algorithm. Using the 2010 i2b2/VA data set, we then conducted a simulation study comparing Cost-CAUSE with noncost-aware AL methods, and a user study comparing Cost-CAUSE with passive learning.

Results: Our cost model fit empirical annotation data well, and Cost-CAUSE increased the simulation area under the learning curve (ALC) scores by up to 5.6% and 4.9%, compared with random sampling and alternate AL methods. Moreover, in a user annotation task, Cost-CAUSE outperformed passive learning on the ALC score and reduced annotation time by 20.5%-30.2%.

Discussion: Although AL has proven effective in simulations, our user study shows that a real-world environment is far more complex. Other factors have a noticeable effect on the AL method, such as the annotation accuracy of users, the tiredness of users, and even the physical and mental condition of users.

Conclusion: Cost-CAUSE saves significant annotation cost compared to random sampling.

Keywords: active learning; electronic health records; named entity recognition, user study; natural language processing.

Publication types

  • Research Support, N.I.H., Extramural

MeSH terms

  • Algorithms*
  • Big Data
  • Computer Simulation
  • Electronic Health Records / economics*
  • Humans
  • Information Storage and Retrieval / economics*
  • Models, Economic
  • Natural Language Processing*