中国政府文件的半监督实体识别

International Conference on Artificial Intelligence and Pattern Recognition Pub Date : 2019-08-16 DOI:10.1145/3357254.3357288

Dagang Chen, Zeyuan Li, Zesong Li, Kunnan Liu, Yajun Song, Peng Wang

{"title":"中国政府文件的半监督实体识别","authors":"Dagang Chen, Zeyuan Li, Zesong Li, Kunnan Liu, Yajun Song, Peng Wang","doi":"10.1145/3357254.3357288","DOIUrl":null,"url":null,"abstract":"There is a large amount of entity information in government documents. Identifying the entity information in government documents is the core foundation of intelligent document processing tasks, such as word segmentation, semantic analysis and knowledge graph construction. To recognize entity, traditional Machine Learning algorithm has the advantage of relatively small tagging corpus requirement. However, this feature also means that this algorithm can hardly capture the implicit semantic information in sentences, which leads to the low accuracy of document entity recognition. Also, this method requires tremendous manual work of feature designing. In contrast, Deep Learning algorithm needs a large tagging corpus. But it gives the algorithm ability to automatically acquire semantic feature information between context. So, the accuracy performance of entity recognition is greatly improved. Combining respective advantages of these above methods, this paper proposes a semi-supervised Deep Learning algorithm framework, which first implement the Conditional Random Field (CRF) and pseudo-labeling to expand the corpus, and then utilize the Dilated Convolution Neural Network (CNN) with Bi-directional Long Short-Term Memory (BiLSTM) plus CRF for extracting entities in official documents. The experimental results show that, compared with other methods, the accuracy, recall rate and F1 value of entity recognition are improved by 5.02%, 5.85% and 5.44% respectively. The proposed method can effectively extract entity information in a document.","PeriodicalId":361892,"journal":{"name":"International Conference on Artificial Intelligence and Pattern Recognition","volume":"69 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-08-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":"{\"title\":\"Semi-supervised entity recognition of Chinese government document\",\"authors\":\"Dagang Chen, Zeyuan Li, Zesong Li, Kunnan Liu, Yajun Song, Peng Wang\",\"doi\":\"10.1145/3357254.3357288\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"There is a large amount of entity information in government documents. Identifying the entity information in government documents is the core foundation of intelligent document processing tasks, such as word segmentation, semantic analysis and knowledge graph construction. To recognize entity, traditional Machine Learning algorithm has the advantage of relatively small tagging corpus requirement. However, this feature also means that this algorithm can hardly capture the implicit semantic information in sentences, which leads to the low accuracy of document entity recognition. Also, this method requires tremendous manual work of feature designing. In contrast, Deep Learning algorithm needs a large tagging corpus. But it gives the algorithm ability to automatically acquire semantic feature information between context. So, the accuracy performance of entity recognition is greatly improved. Combining respective advantages of these above methods, this paper proposes a semi-supervised Deep Learning algorithm framework, which first implement the Conditional Random Field (CRF) and pseudo-labeling to expand the corpus, and then utilize the Dilated Convolution Neural Network (CNN) with Bi-directional Long Short-Term Memory (BiLSTM) plus CRF for extracting entities in official documents. The experimental results show that, compared with other methods, the accuracy, recall rate and F1 value of entity recognition are improved by 5.02%, 5.85% and 5.44% respectively. The proposed method can effectively extract entity information in a document.\",\"PeriodicalId\":361892,\"journal\":{\"name\":\"International Conference on Artificial Intelligence and Pattern Recognition\",\"volume\":\"69 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-08-16\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"3\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"International Conference on Artificial Intelligence and Pattern Recognition\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3357254.3357288\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Conference on Artificial Intelligence and Pattern Recognition","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3357254.3357288","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

摘要

政府文件中存在着大量的实体信息。识别政府文件中的实体信息是实现分词、语义分析和知识图谱构建等智能文档处理任务的核心基础。为了识别实体，传统的机器学习算法具有对标注语料库要求相对较小的优点。然而，这一特点也意味着该算法很难捕捉到句子中隐含的语义信息，从而导致文档实体识别的准确率较低。同时，这种方法需要大量的特征设计的手工工作。相比之下，深度学习算法需要一个庞大的标注语料库。但它赋予了算法自动获取上下文之间语义特征信息的能力。从而大大提高了实体识别的准确率。结合上述方法各自的优点，本文提出了一种半监督深度学习算法框架，该算法首先实现条件随机场(CRF)和伪标注来扩展语料库，然后利用双向长短期记忆(BiLSTM)的扩张卷积神经网络(CNN)加上CRF来提取官方文档中的实体。实验结果表明，与其他方法相比，实体识别的正确率、召回率和F1值分别提高了5.02%、5.85%和5.44%。该方法可以有效地提取文档中的实体信息。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Semi-supervised entity recognition of Chinese government document

There is a large amount of entity information in government documents. Identifying the entity information in government documents is the core foundation of intelligent document processing tasks, such as word segmentation, semantic analysis and knowledge graph construction. To recognize entity, traditional Machine Learning algorithm has the advantage of relatively small tagging corpus requirement. However, this feature also means that this algorithm can hardly capture the implicit semantic information in sentences, which leads to the low accuracy of document entity recognition. Also, this method requires tremendous manual work of feature designing. In contrast, Deep Learning algorithm needs a large tagging corpus. But it gives the algorithm ability to automatically acquire semantic feature information between context. So, the accuracy performance of entity recognition is greatly improved. Combining respective advantages of these above methods, this paper proposes a semi-supervised Deep Learning algorithm framework, which first implement the Conditional Random Field (CRF) and pseudo-labeling to expand the corpus, and then utilize the Dilated Convolution Neural Network (CNN) with Bi-directional Long Short-Term Memory (BiLSTM) plus CRF for extracting entities in official documents. The experimental results show that, compared with other methods, the accuracy, recall rate and F1 value of entity recognition are improved by 5.02%, 5.85% and 5.44% respectively. The proposed method can effectively extract entity information in a document.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

International Conference on Artificial Intelligence and Pattern Recognition

自引率

0.00%

发文量