A deep active learning-based and crowdsourcing-assisted solution for named entity recognition in Chinese historical corpora

Aslib J. Inf. Manag. Pub Date : 2022-12-13 DOI:10.1108/ajim-03-2022-0107

Chengxi Yan, Xuemei Tang, Haoxia Yang, Jun Wang

{"title":"A deep active learning-based and crowdsourcing-assisted solution for named entity recognition in Chinese historical corpora","authors":"Chengxi Yan, Xuemei Tang, Haoxia Yang, Jun Wang","doi":"10.1108/ajim-03-2022-0107","DOIUrl":null,"url":null,"abstract":"PurposeThe majority of existing studies about named entity recognition (NER) concentrate on the prediction enhancement of deep neural network (DNN)-based models themselves, but the issues about the scarcity of training corpus and the difficulty of annotation quality control are not fully solved, especially for Chinese ancient corpora. Therefore, designing a new integrated solution for Chinese historical NER, including automatic entity extraction and man-machine cooperative annotation, is quite valuable for improving the effectiveness of Chinese historical NER and fostering the development of low-resource information extraction.Design/methodology/approachThe research provides a systematic approach for Chinese historical NER with a three-stage framework. In addition to the stage of basic preprocessing, the authors create, retrain and yield a high-performance NER model only using limited labeled resources during the stage of augmented deep active learning (ADAL), which entails three steps—DNN-based NER modeling, hybrid pool-based sampling (HPS) based on the active learning (AL), and NER-oriented data augmentation (DA). ADAL is thought to have the capacity to maintain the performance of DNN as high as possible under the few-shot constraint. Then, to realize machine-aided quality control in crowdsourcing settings, the authors design a stage of globally-optimized automatic label consolidation (GALC). The core of GALC is a newly-designed label consolidation model called simulated annealing-based automatic label aggregation (“SA-ALC”), which incorporates the factors of worker reliability and global label estimation. The model can assure the annotation quality of those data from a crowdsourcing annotation system.FindingsExtensive experiments on two types of Chinese classical historical datasets show that the authors’ solution can effectively reduce the corpus dependency of a DNN-based NER model and alleviate the problem of label quality. Moreover, the results also show the superior performance of the authors’ pipeline approaches (i.e. HPS + DA and SA-ALC) compared to equivalent baselines in each stage.Originality/valueThe study sheds new light on the automatic extraction of Chinese historical entities in an all-technological-process integration. The solution is helpful to effectively reducing the annotation cost and controlling the labeling quality for the NER task. It can be further applied to similar tasks of information extraction and other low-resource fields in theoretical and practical ways.","PeriodicalId":421104,"journal":{"name":"Aslib J. Inf. Manag.","volume":"64 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Aslib J. Inf. Manag.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1108/ajim-03-2022-0107","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

Abstract

PurposeThe majority of existing studies about named entity recognition (NER) concentrate on the prediction enhancement of deep neural network (DNN)-based models themselves, but the issues about the scarcity of training corpus and the difficulty of annotation quality control are not fully solved, especially for Chinese ancient corpora. Therefore, designing a new integrated solution for Chinese historical NER, including automatic entity extraction and man-machine cooperative annotation, is quite valuable for improving the effectiveness of Chinese historical NER and fostering the development of low-resource information extraction.Design/methodology/approachThe research provides a systematic approach for Chinese historical NER with a three-stage framework. In addition to the stage of basic preprocessing, the authors create, retrain and yield a high-performance NER model only using limited labeled resources during the stage of augmented deep active learning (ADAL), which entails three steps—DNN-based NER modeling, hybrid pool-based sampling (HPS) based on the active learning (AL), and NER-oriented data augmentation (DA). ADAL is thought to have the capacity to maintain the performance of DNN as high as possible under the few-shot constraint. Then, to realize machine-aided quality control in crowdsourcing settings, the authors design a stage of globally-optimized automatic label consolidation (GALC). The core of GALC is a newly-designed label consolidation model called simulated annealing-based automatic label aggregation (“SA-ALC”), which incorporates the factors of worker reliability and global label estimation. The model can assure the annotation quality of those data from a crowdsourcing annotation system.FindingsExtensive experiments on two types of Chinese classical historical datasets show that the authors’ solution can effectively reduce the corpus dependency of a DNN-based NER model and alleviate the problem of label quality. Moreover, the results also show the superior performance of the authors’ pipeline approaches (i.e. HPS + DA and SA-ALC) compared to equivalent baselines in each stage.Originality/valueThe study sheds new light on the automatic extraction of Chinese historical entities in an all-technological-process integration. The solution is helpful to effectively reducing the annotation cost and controlling the labeling quality for the NER task. It can be further applied to similar tasks of information extraction and other low-resource fields in theoretical and practical ways.

查看原文本刊更多论文

基于深度主动学习和众包辅助的中文历史语料库命名实体识别解决方案

目的现有的命名实体识别(NER)研究大多集中在基于深度神经网络(DNN)的模型本身的预测增强上，但尚未完全解决训练语料库稀缺和标注质量控制困难的问题，特别是对中国古代语料库而言。因此，设计包括实体自动抽取和人机协同标注在内的中文历史NER集成解决方案，对于提高中文历史NER的有效性，促进低资源信息抽取的发展具有重要意义。设计/方法/途径本研究以三个阶段的框架为中国历史NER提供了一个系统的研究方法。除了基本预处理阶段，作者在增强深度主动学习(ADAL)阶段仅使用有限的标记资源创建，重新训练并生成高性能NER模型，该阶段包括三个步骤:基于dnn的NER建模，基于主动学习(AL)的基于混合池的采样(HPS)和面向NER的数据增强(DA)。ADAL被认为有能力在few-shot约束下保持DNN尽可能高的性能。然后，为了在众包环境下实现机器辅助质量控制，作者设计了一个全局优化的自动标签整合(GALC)阶段。GALC的核心是一种新设计的标签整合模型——基于模拟退火的自动标签聚合(SA-ALC)，该模型结合了工人可靠性和全局标签估计的因素。该模型可以保证众包标注系统对这些数据的标注质量。在两类中文经典历史数据集上进行的大量实验表明，该方法可以有效降低基于dnn的NER模型的语料库依赖性，缓解标签质量问题。此外，与每个阶段的等效基线相比，结果还显示了作者的管道方法(即HPS + DA和SA-ALC)的优越性能。原创性/价值本研究为全技术-过程一体化中中国历史实体的自动提取提供了新的视角。该解决方案有助于有效降低NER任务的标注成本和控制标注质量。该方法在理论和实践上都可以进一步应用于类似的信息提取任务和其他低资源领域。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Aslib J. Inf. Manag.

自引率

0.00%

发文量