{"title":"Sequence Labeling of Chinese Text Based on Bidirectional Gru-Cnn-Crf Model","authors":"Dil Iu, Xinyi Zou","doi":"10.1109/ICCWAMTIP.2018.8632570","DOIUrl":null,"url":null,"abstract":"Sequence labeling is the basis for many tasks in natural language processing (NLP). It plays an important role in tasks such as word segmentation, named entity recognition (NER), and part-of-speech (POS)tagging. The current mainstream method for sequence labeling is to combine neural network with conditional random field (CRF). The common model is usually a bidirectional RNN-CRF model, which can solve the problem that the labeling task with traditional method cannot be combined well with the context. This paper proposes a Chinese sequence labeling model based on bidirectional GRU-CNN-CRF, which can pay more attention to local features and context relationships, and has better performance in word segmentation and NER. This paper takes the corpus provided by Chinese Wikipedia as the training data set and preprocesses the text by word embedding. The data are then processed through a three-tier architecture of bidirectional Gated Recurrent Unit (GRU), Convolution Neural Network (CNN)and CRF, and finally complete the task of sequence annotation. Compared with the traditional Chinese word segmentation system, this method is more accurate. And it performs better than bidirectional GRU-CRF model on NER issues.","PeriodicalId":117919,"journal":{"name":"2018 15th International Computer Conference on Wavelet Active Media Technology and Information Processing (ICCWAMTIP)","volume":"60 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 15th International Computer Conference on Wavelet Active Media Technology and Information Processing (ICCWAMTIP)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICCWAMTIP.2018.8632570","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1
Abstract
Sequence labeling is the basis for many tasks in natural language processing (NLP). It plays an important role in tasks such as word segmentation, named entity recognition (NER), and part-of-speech (POS)tagging. The current mainstream method for sequence labeling is to combine neural network with conditional random field (CRF). The common model is usually a bidirectional RNN-CRF model, which can solve the problem that the labeling task with traditional method cannot be combined well with the context. This paper proposes a Chinese sequence labeling model based on bidirectional GRU-CNN-CRF, which can pay more attention to local features and context relationships, and has better performance in word segmentation and NER. This paper takes the corpus provided by Chinese Wikipedia as the training data set and preprocesses the text by word embedding. The data are then processed through a three-tier architecture of bidirectional Gated Recurrent Unit (GRU), Convolution Neural Network (CNN)and CRF, and finally complete the task of sequence annotation. Compared with the traditional Chinese word segmentation system, this method is more accurate. And it performs better than bidirectional GRU-CRF model on NER issues.
序列标注是自然语言处理(NLP)中许多任务的基础。它在分词、命名实体识别(NER)和词性标注(POS)等任务中发挥着重要作用。目前主流的序列标记方法是将神经网络与条件随机场(conditional random field, CRF)相结合。常用的模型通常是双向RNN-CRF模型,该模型可以解决传统方法标注任务不能很好地与上下文结合的问题。本文提出了一种基于双向GRU-CNN-CRF的中文序列标注模型,该模型更关注局部特征和上下文关系,在分词和NER方面具有更好的性能。本文以中文维基百科提供的语料库作为训练数据集,采用词嵌入的方法对文本进行预处理。然后通过双向门控循环单元(GRU)、卷积神经网络(CNN)和CRF三层架构对数据进行处理,最终完成序列标注任务。与传统的中文分词系统相比,该方法具有更高的分词准确率。在NER问题上,该模型优于双向GRU-CRF模型。