Dagang Chen, Zeyuan Li, Zesong Li, Kunnan Liu, Yajun Song, Peng Wang
{"title":"中国政府文件的半监督实体识别","authors":"Dagang Chen, Zeyuan Li, Zesong Li, Kunnan Liu, Yajun Song, Peng Wang","doi":"10.1145/3357254.3357288","DOIUrl":null,"url":null,"abstract":"There is a large amount of entity information in government documents. Identifying the entity information in government documents is the core foundation of intelligent document processing tasks, such as word segmentation, semantic analysis and knowledge graph construction. To recognize entity, traditional Machine Learning algorithm has the advantage of relatively small tagging corpus requirement. However, this feature also means that this algorithm can hardly capture the implicit semantic information in sentences, which leads to the low accuracy of document entity recognition. Also, this method requires tremendous manual work of feature designing. In contrast, Deep Learning algorithm needs a large tagging corpus. But it gives the algorithm ability to automatically acquire semantic feature information between context. So, the accuracy performance of entity recognition is greatly improved. Combining respective advantages of these above methods, this paper proposes a semi-supervised Deep Learning algorithm framework, which first implement the Conditional Random Field (CRF) and pseudo-labeling to expand the corpus, and then utilize the Dilated Convolution Neural Network (CNN) with Bi-directional Long Short-Term Memory (BiLSTM) plus CRF for extracting entities in official documents. The experimental results show that, compared with other methods, the accuracy, recall rate and F1 value of entity recognition are improved by 5.02%, 5.85% and 5.44% respectively. The proposed method can effectively extract entity information in a document.","PeriodicalId":361892,"journal":{"name":"International Conference on Artificial Intelligence and Pattern Recognition","volume":"69 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-08-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":"{\"title\":\"Semi-supervised entity recognition of Chinese government document\",\"authors\":\"Dagang Chen, Zeyuan Li, Zesong Li, Kunnan Liu, Yajun Song, Peng Wang\",\"doi\":\"10.1145/3357254.3357288\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"There is a large amount of entity information in government documents. Identifying the entity information in government documents is the core foundation of intelligent document processing tasks, such as word segmentation, semantic analysis and knowledge graph construction. To recognize entity, traditional Machine Learning algorithm has the advantage of relatively small tagging corpus requirement. However, this feature also means that this algorithm can hardly capture the implicit semantic information in sentences, which leads to the low accuracy of document entity recognition. Also, this method requires tremendous manual work of feature designing. In contrast, Deep Learning algorithm needs a large tagging corpus. But it gives the algorithm ability to automatically acquire semantic feature information between context. So, the accuracy performance of entity recognition is greatly improved. Combining respective advantages of these above methods, this paper proposes a semi-supervised Deep Learning algorithm framework, which first implement the Conditional Random Field (CRF) and pseudo-labeling to expand the corpus, and then utilize the Dilated Convolution Neural Network (CNN) with Bi-directional Long Short-Term Memory (BiLSTM) plus CRF for extracting entities in official documents. The experimental results show that, compared with other methods, the accuracy, recall rate and F1 value of entity recognition are improved by 5.02%, 5.85% and 5.44% respectively. The proposed method can effectively extract entity information in a document.\",\"PeriodicalId\":361892,\"journal\":{\"name\":\"International Conference on Artificial Intelligence and Pattern Recognition\",\"volume\":\"69 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-08-16\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"3\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"International Conference on Artificial Intelligence and Pattern Recognition\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3357254.3357288\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Conference on Artificial Intelligence and Pattern Recognition","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3357254.3357288","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Semi-supervised entity recognition of Chinese government document
There is a large amount of entity information in government documents. Identifying the entity information in government documents is the core foundation of intelligent document processing tasks, such as word segmentation, semantic analysis and knowledge graph construction. To recognize entity, traditional Machine Learning algorithm has the advantage of relatively small tagging corpus requirement. However, this feature also means that this algorithm can hardly capture the implicit semantic information in sentences, which leads to the low accuracy of document entity recognition. Also, this method requires tremendous manual work of feature designing. In contrast, Deep Learning algorithm needs a large tagging corpus. But it gives the algorithm ability to automatically acquire semantic feature information between context. So, the accuracy performance of entity recognition is greatly improved. Combining respective advantages of these above methods, this paper proposes a semi-supervised Deep Learning algorithm framework, which first implement the Conditional Random Field (CRF) and pseudo-labeling to expand the corpus, and then utilize the Dilated Convolution Neural Network (CNN) with Bi-directional Long Short-Term Memory (BiLSTM) plus CRF for extracting entities in official documents. The experimental results show that, compared with other methods, the accuracy, recall rate and F1 value of entity recognition are improved by 5.02%, 5.85% and 5.44% respectively. The proposed method can effectively extract entity information in a document.