New Words Discovery Method Based On Word Segmentation Result

Heyang Liu, Pengdong Gao, Yi Xiao
{"title":"New Words Discovery Method Based On Word Segmentation Result","authors":"Heyang Liu, Pengdong Gao, Yi Xiao","doi":"10.1109/ICIS.2018.8466490","DOIUrl":null,"url":null,"abstract":"A kind of new words discovery method based on word segmentation result is presented in this paper. Word segmentation is an important part of many Chinese Natural language processing (NLP) tasks. Improving the accuracy of Chinese word segmentation is a matter of great concern. With the increasing number of web text, more and more Chinese NLP tasks need to use micro-blog, movie review and other web text. The content of web text changes very fast and often contains a large number of new words. It is an important factor affecting the accuracy of word segmentation that word segmentation tools can not identify these new words. One way to solve this problem is to discover new words in the text to use and add these new words to the dictionaries on which the word segmentation tool depends. The traditional method of new words discovery can only find the words that do not exist in word segmentation tool’s dictionary. But these words do not necessarily affect the result of the word segmentation. That is, the words may be correctly segmented even if they are not added to the word segmentation tool’s dictionary. To address this issue, we propose to build a collection of candidate new words based on segmentation result. All the new words discovered in this way segmented by the word segmentation tool by mistake. Adding these new words to the word segmentation tool’s dictionary can improve the accuracy of the word segmentation more than traditional methods. Experiments on the DouBan movie review dataset show that our method can get better new words to improve the accuracy on movie review sentiment classification.","PeriodicalId":447019,"journal":{"name":"2018 IEEE/ACIS 17th International Conference on Computer and Information Science (ICIS)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 IEEE/ACIS 17th International Conference on Computer and Information Science (ICIS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICIS.2018.8466490","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 6

Abstract

A kind of new words discovery method based on word segmentation result is presented in this paper. Word segmentation is an important part of many Chinese Natural language processing (NLP) tasks. Improving the accuracy of Chinese word segmentation is a matter of great concern. With the increasing number of web text, more and more Chinese NLP tasks need to use micro-blog, movie review and other web text. The content of web text changes very fast and often contains a large number of new words. It is an important factor affecting the accuracy of word segmentation that word segmentation tools can not identify these new words. One way to solve this problem is to discover new words in the text to use and add these new words to the dictionaries on which the word segmentation tool depends. The traditional method of new words discovery can only find the words that do not exist in word segmentation tool’s dictionary. But these words do not necessarily affect the result of the word segmentation. That is, the words may be correctly segmented even if they are not added to the word segmentation tool’s dictionary. To address this issue, we propose to build a collection of candidate new words based on segmentation result. All the new words discovered in this way segmented by the word segmentation tool by mistake. Adding these new words to the word segmentation tool’s dictionary can improve the accuracy of the word segmentation more than traditional methods. Experiments on the DouBan movie review dataset show that our method can get better new words to improve the accuracy on movie review sentiment classification.
基于分词结果的新词发现方法
提出了一种基于分词结果的新词发现方法。分词是汉语自然语言处理(NLP)任务的重要组成部分。提高汉语分词的准确率是一个备受关注的问题。随着网络文本的不断增多,越来越多的中文NLP任务需要用到微博、影评等网络文本。网络文本的内容变化非常快,往往包含大量的新词。分词工具不能识别这些新词是影响分词精度的一个重要因素。解决这个问题的一种方法是在文本中发现要使用的新词,并将这些新词添加到分词工具所依赖的字典中。传统的新词发现方法只能找到分词工具字典中不存在的词。但是这些词并不一定会影响分词的结果。也就是说,即使单词没有添加到分词工具的字典中,它们也可能被正确分词。为了解决这个问题,我们提出了一个基于分词结果的候选新词集合。所有以这种方式发现的新词被分词工具误分。将这些新单词添加到分词工具的词典中,可以比传统方法更有效地提高分词的准确性。在豆瓣影评数据集上的实验表明,该方法可以获得更好的新词,提高影评情感分类的准确率。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信