{"title":"Vocabulary enhancement in Chinese-named entity recognition","authors":"Lichen Xu, Xue-feng Fu, Yuehua Wu, Qian-Hui Gu","doi":"10.1109/AEMCSE55572.2022.00119","DOIUrl":null,"url":null,"abstract":"In the traditional Chinese-named entity recognition system, the word-based sequence labeling model is affected by the effect of word segmentation, which is easy to cause entity boundary detection errors. Although the character-based sequence labeling model avoids the error propagation of the word segmentation system, it loses a lot of lexical information because its model can only learn the original language signals at the character level. This leads to the blurred boundary of the entity and the poor effect of entity recognition. In order to solve the problem that it is difficult to demarcate the boundaries of Chinese-named entities, a vocabulary enhancement model is proposed. First of all, the model starts from the character-based sequence labeling model to avoid the error propagation of Chinese word segmentation. Then, it is integrated into the external lexicon to increase the lexical information and improve the entity boundary. Finally, the ERNIE pre-trained language model is introduced to supplement the hidden vocabulary features and improve the contextual information capture ability of words. Therefore, the model has a strong semantic awareness, which significantly improves the effect of Chinese-named entity recognition in each classical data set.","PeriodicalId":309096,"journal":{"name":"2022 5th International Conference on Advanced Electronic Materials, Computers and Software Engineering (AEMCSE)","volume":"99 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 5th International Conference on Advanced Electronic Materials, Computers and Software Engineering (AEMCSE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/AEMCSE55572.2022.00119","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
In the traditional Chinese-named entity recognition system, the word-based sequence labeling model is affected by the effect of word segmentation, which is easy to cause entity boundary detection errors. Although the character-based sequence labeling model avoids the error propagation of the word segmentation system, it loses a lot of lexical information because its model can only learn the original language signals at the character level. This leads to the blurred boundary of the entity and the poor effect of entity recognition. In order to solve the problem that it is difficult to demarcate the boundaries of Chinese-named entities, a vocabulary enhancement model is proposed. First of all, the model starts from the character-based sequence labeling model to avoid the error propagation of Chinese word segmentation. Then, it is integrated into the external lexicon to increase the lexical information and improve the entity boundary. Finally, the ERNIE pre-trained language model is introduced to supplement the hidden vocabulary features and improve the contextual information capture ability of words. Therefore, the model has a strong semantic awareness, which significantly improves the effect of Chinese-named entity recognition in each classical data set.