中国政府文件的半监督实体识别

Dagang Chen, Zeyuan Li, Zesong Li, Kunnan Liu, Yajun Song, Peng Wang
{"title":"中国政府文件的半监督实体识别","authors":"Dagang Chen, Zeyuan Li, Zesong Li, Kunnan Liu, Yajun Song, Peng Wang","doi":"10.1145/3357254.3357288","DOIUrl":null,"url":null,"abstract":"There is a large amount of entity information in government documents. Identifying the entity information in government documents is the core foundation of intelligent document processing tasks, such as word segmentation, semantic analysis and knowledge graph construction. To recognize entity, traditional Machine Learning algorithm has the advantage of relatively small tagging corpus requirement. However, this feature also means that this algorithm can hardly capture the implicit semantic information in sentences, which leads to the low accuracy of document entity recognition. Also, this method requires tremendous manual work of feature designing. In contrast, Deep Learning algorithm needs a large tagging corpus. But it gives the algorithm ability to automatically acquire semantic feature information between context. So, the accuracy performance of entity recognition is greatly improved. Combining respective advantages of these above methods, this paper proposes a semi-supervised Deep Learning algorithm framework, which first implement the Conditional Random Field (CRF) and pseudo-labeling to expand the corpus, and then utilize the Dilated Convolution Neural Network (CNN) with Bi-directional Long Short-Term Memory (BiLSTM) plus CRF for extracting entities in official documents. The experimental results show that, compared with other methods, the accuracy, recall rate and F1 value of entity recognition are improved by 5.02%, 5.85% and 5.44% respectively. The proposed method can effectively extract entity information in a document.","PeriodicalId":361892,"journal":{"name":"International Conference on Artificial Intelligence and Pattern Recognition","volume":"69 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-08-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":"{\"title\":\"Semi-supervised entity recognition of Chinese government document\",\"authors\":\"Dagang Chen, Zeyuan Li, Zesong Li, Kunnan Liu, Yajun Song, Peng Wang\",\"doi\":\"10.1145/3357254.3357288\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"There is a large amount of entity information in government documents. Identifying the entity information in government documents is the core foundation of intelligent document processing tasks, such as word segmentation, semantic analysis and knowledge graph construction. To recognize entity, traditional Machine Learning algorithm has the advantage of relatively small tagging corpus requirement. However, this feature also means that this algorithm can hardly capture the implicit semantic information in sentences, which leads to the low accuracy of document entity recognition. Also, this method requires tremendous manual work of feature designing. In contrast, Deep Learning algorithm needs a large tagging corpus. But it gives the algorithm ability to automatically acquire semantic feature information between context. So, the accuracy performance of entity recognition is greatly improved. Combining respective advantages of these above methods, this paper proposes a semi-supervised Deep Learning algorithm framework, which first implement the Conditional Random Field (CRF) and pseudo-labeling to expand the corpus, and then utilize the Dilated Convolution Neural Network (CNN) with Bi-directional Long Short-Term Memory (BiLSTM) plus CRF for extracting entities in official documents. The experimental results show that, compared with other methods, the accuracy, recall rate and F1 value of entity recognition are improved by 5.02%, 5.85% and 5.44% respectively. The proposed method can effectively extract entity information in a document.\",\"PeriodicalId\":361892,\"journal\":{\"name\":\"International Conference on Artificial Intelligence and Pattern Recognition\",\"volume\":\"69 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-08-16\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"3\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"International Conference on Artificial Intelligence and Pattern Recognition\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3357254.3357288\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Conference on Artificial Intelligence and Pattern Recognition","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3357254.3357288","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3

摘要

政府文件中存在着大量的实体信息。识别政府文件中的实体信息是实现分词、语义分析和知识图谱构建等智能文档处理任务的核心基础。为了识别实体,传统的机器学习算法具有对标注语料库要求相对较小的优点。然而,这一特点也意味着该算法很难捕捉到句子中隐含的语义信息,从而导致文档实体识别的准确率较低。同时,这种方法需要大量的特征设计的手工工作。相比之下,深度学习算法需要一个庞大的标注语料库。但它赋予了算法自动获取上下文之间语义特征信息的能力。从而大大提高了实体识别的准确率。结合上述方法各自的优点,本文提出了一种半监督深度学习算法框架,该算法首先实现条件随机场(CRF)和伪标注来扩展语料库,然后利用双向长短期记忆(BiLSTM)的扩张卷积神经网络(CNN)加上CRF来提取官方文档中的实体。实验结果表明,与其他方法相比,实体识别的正确率、召回率和F1值分别提高了5.02%、5.85%和5.44%。该方法可以有效地提取文档中的实体信息。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Semi-supervised entity recognition of Chinese government document
There is a large amount of entity information in government documents. Identifying the entity information in government documents is the core foundation of intelligent document processing tasks, such as word segmentation, semantic analysis and knowledge graph construction. To recognize entity, traditional Machine Learning algorithm has the advantage of relatively small tagging corpus requirement. However, this feature also means that this algorithm can hardly capture the implicit semantic information in sentences, which leads to the low accuracy of document entity recognition. Also, this method requires tremendous manual work of feature designing. In contrast, Deep Learning algorithm needs a large tagging corpus. But it gives the algorithm ability to automatically acquire semantic feature information between context. So, the accuracy performance of entity recognition is greatly improved. Combining respective advantages of these above methods, this paper proposes a semi-supervised Deep Learning algorithm framework, which first implement the Conditional Random Field (CRF) and pseudo-labeling to expand the corpus, and then utilize the Dilated Convolution Neural Network (CNN) with Bi-directional Long Short-Term Memory (BiLSTM) plus CRF for extracting entities in official documents. The experimental results show that, compared with other methods, the accuracy, recall rate and F1 value of entity recognition are improved by 5.02%, 5.85% and 5.44% respectively. The proposed method can effectively extract entity information in a document.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信