条件随机场模型下的网络新生词识别方法

J. Zhou
{"title":"条件随机场模型下的网络新生词识别方法","authors":"J. Zhou","doi":"10.1109/ICVRIS.2018.00136","DOIUrl":null,"url":null,"abstract":"This paper proposes an approach of automatic detection of new words. It analyzes the webpages acquired from Internet on a large scale and detect new words. According to morphological rules it will perform further filtering on detection results to extract existed new words in the corpus. Our scheme adopts conditional random field and puts forward twp improvement: in the new word detection stage, a high efficient left (right) entropy calculation method is proposed to improve the detection speed, which effectively reduces the influence of unrelated characters in the calculation; then, a quantized model of missing logged words is also proposed, which is based on participle to extract repeated strings, and it can be used to evaluate the problem of missing words. The CRF based combination model proposed in this paper is proved to be a very effective new word detection method, whether from the open experiment effect or the generalization ability of the model.","PeriodicalId":152317,"journal":{"name":"2018 International Conference on Virtual Reality and Intelligent Systems (ICVRIS)","volume":"152 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"Internet Newborn Word Recognition Method under Conditional Random Field Model\",\"authors\":\"J. Zhou\",\"doi\":\"10.1109/ICVRIS.2018.00136\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This paper proposes an approach of automatic detection of new words. It analyzes the webpages acquired from Internet on a large scale and detect new words. According to morphological rules it will perform further filtering on detection results to extract existed new words in the corpus. Our scheme adopts conditional random field and puts forward twp improvement: in the new word detection stage, a high efficient left (right) entropy calculation method is proposed to improve the detection speed, which effectively reduces the influence of unrelated characters in the calculation; then, a quantized model of missing logged words is also proposed, which is based on participle to extract repeated strings, and it can be used to evaluate the problem of missing words. The CRF based combination model proposed in this paper is proved to be a very effective new word detection method, whether from the open experiment effect or the generalization ability of the model.\",\"PeriodicalId\":152317,\"journal\":{\"name\":\"2018 International Conference on Virtual Reality and Intelligent Systems (ICVRIS)\",\"volume\":\"152 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-08-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2018 International Conference on Virtual Reality and Intelligent Systems (ICVRIS)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICVRIS.2018.00136\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 International Conference on Virtual Reality and Intelligent Systems (ICVRIS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICVRIS.2018.00136","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2

摘要

本文提出了一种自动检测新词的方法。它对从互联网上获取的大量网页进行分析,并检测新词。根据形态规则对检测结果进行进一步过滤,提取语料库中存在的新词。我们的方案采用条件随机场,并提出twp改进:在新词检测阶段,提出了高效的左(右)熵计算方法,提高了检测速度,有效降低了不相关字符在计算中的影响;然后,提出了一种基于分词提取重复字符串的缺失日志词量化模型,该模型可用于评估缺失词问题。无论是从开放实验效果还是从模型的泛化能力来看,本文提出的基于CRF的组合模型都被证明是一种非常有效的新词检测方法。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Internet Newborn Word Recognition Method under Conditional Random Field Model
This paper proposes an approach of automatic detection of new words. It analyzes the webpages acquired from Internet on a large scale and detect new words. According to morphological rules it will perform further filtering on detection results to extract existed new words in the corpus. Our scheme adopts conditional random field and puts forward twp improvement: in the new word detection stage, a high efficient left (right) entropy calculation method is proposed to improve the detection speed, which effectively reduces the influence of unrelated characters in the calculation; then, a quantized model of missing logged words is also proposed, which is based on participle to extract repeated strings, and it can be used to evaluate the problem of missing words. The CRF based combination model proposed in this paper is proved to be a very effective new word detection method, whether from the open experiment effect or the generalization ability of the model.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信