中文未知词提取统计方法的直方图均衡化

Bor-Shen Lin, Yi-Cong Chen
{"title":"中文未知词提取统计方法的直方图均衡化","authors":"Bor-Shen Lin, Yi-Cong Chen","doi":"10.30019/IJCLCLP.201112.0003","DOIUrl":null,"url":null,"abstract":"With the evolution of human lives and the spread of information, new things emerge quickly and new terms are created every day. Therefore, it is important for natural language processing systems to extract new words in progression with time. Due to the broad areas of applications, however, there might exist the mismatch of statistical characteristics between the training domain and the testing domain, which inevitably degrades the performance of word extraction. This paper proposes a scheme of word extraction in which histogram equalization for feature normalization is used. Through this scheme, the mismatch of the feature distributions due to different corpus sizes or changes of domain can be compensated for appropriately such that unknown word extraction becomes more reliable and applicable to novice domains. The scheme was initially evaluated on the corpora announced in SIGHAN2. 68.43% and 71.40% F-measures for word identification, which correspond to 66.72%/32.94% and 75.99%/58.39% recall rates for IV/OOV, respectively, were achieved for the CKIP and the CUHK test sets, respectively, using four combined features with equalization. When applied to unknown word extraction for a novice domain, this scheme can identify such pronouns as ”海角七號” (Cape No. 7, the name of a film), ”蠟筆小新” (Crayon Shinchan, the name of a cartoon figure), ”金融海嘯” (Financial Tsunami) and so on, which cannot be extracted reliably with rule-based approaches, although the approach appears not so good at identifying such terms as the names of humans, places, or organizations, for which the semantic structure is prominent. This scheme is complementary with the outcomes of two word segmentation systems, and is promising if other rule-based approaches could be further integrated.","PeriodicalId":436300,"journal":{"name":"Int. J. Comput. Linguistics Chin. Lang. Process.","volume":"14 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2011-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Histogram Equalization on Statistical Approaches for Chinese Unknown Word Extraction\",\"authors\":\"Bor-Shen Lin, Yi-Cong Chen\",\"doi\":\"10.30019/IJCLCLP.201112.0003\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"With the evolution of human lives and the spread of information, new things emerge quickly and new terms are created every day. Therefore, it is important for natural language processing systems to extract new words in progression with time. Due to the broad areas of applications, however, there might exist the mismatch of statistical characteristics between the training domain and the testing domain, which inevitably degrades the performance of word extraction. This paper proposes a scheme of word extraction in which histogram equalization for feature normalization is used. Through this scheme, the mismatch of the feature distributions due to different corpus sizes or changes of domain can be compensated for appropriately such that unknown word extraction becomes more reliable and applicable to novice domains. The scheme was initially evaluated on the corpora announced in SIGHAN2. 68.43% and 71.40% F-measures for word identification, which correspond to 66.72%/32.94% and 75.99%/58.39% recall rates for IV/OOV, respectively, were achieved for the CKIP and the CUHK test sets, respectively, using four combined features with equalization. When applied to unknown word extraction for a novice domain, this scheme can identify such pronouns as ”海角七號” (Cape No. 7, the name of a film), ”蠟筆小新” (Crayon Shinchan, the name of a cartoon figure), ”金融海嘯” (Financial Tsunami) and so on, which cannot be extracted reliably with rule-based approaches, although the approach appears not so good at identifying such terms as the names of humans, places, or organizations, for which the semantic structure is prominent. This scheme is complementary with the outcomes of two word segmentation systems, and is promising if other rule-based approaches could be further integrated.\",\"PeriodicalId\":436300,\"journal\":{\"name\":\"Int. J. Comput. Linguistics Chin. Lang. Process.\",\"volume\":\"14 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2011-12-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Int. J. Comput. Linguistics Chin. Lang. Process.\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.30019/IJCLCLP.201112.0003\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Int. J. Comput. Linguistics Chin. Lang. Process.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.30019/IJCLCLP.201112.0003","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

随着人类生活的进化和信息的传播,新事物层出不穷,每天都有新的术语产生。因此,对自然语言处理系统来说,随着时间的推移提取新词是很重要的。然而,由于应用领域广泛,训练域和测试域之间可能存在统计特征不匹配的问题,这将不可避免地降低词提取的性能。本文提出了一种利用直方图均衡化进行特征归一化的词提取方案。通过该方案,可以适当补偿因语料库大小或领域变化而导致的特征分布不匹配,使未知词提取更加可靠,适用于新手领域。该方案最初在sigan2中公布的语料库上进行了评估。使用均衡化的4个组合特征,CKIP和CUHK测试集的单词识别f测度分别达到68.43%和71.40%,对应于IV/OOV的召回率分别为66.72%/32.94%和75.99%/58.39%。当应用于未知词提取对于新手领域,这个方案可以识别等代词“海角七號”(海角七号,一个电影的名字),“蠟筆小新”(蜡笔Shinchan,一个卡通人物的名字),“金融海嘯”(金融海啸)等等,不能提取的可靠与基于规则的方法,尽管这种方法似乎不太擅长识别人类的名字,等方面的地方,或组织,语义结构突出。该方案与两种分词系统的结果是互补的,如果其他基于规则的方法可以进一步集成,则前景广阔。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Histogram Equalization on Statistical Approaches for Chinese Unknown Word Extraction
With the evolution of human lives and the spread of information, new things emerge quickly and new terms are created every day. Therefore, it is important for natural language processing systems to extract new words in progression with time. Due to the broad areas of applications, however, there might exist the mismatch of statistical characteristics between the training domain and the testing domain, which inevitably degrades the performance of word extraction. This paper proposes a scheme of word extraction in which histogram equalization for feature normalization is used. Through this scheme, the mismatch of the feature distributions due to different corpus sizes or changes of domain can be compensated for appropriately such that unknown word extraction becomes more reliable and applicable to novice domains. The scheme was initially evaluated on the corpora announced in SIGHAN2. 68.43% and 71.40% F-measures for word identification, which correspond to 66.72%/32.94% and 75.99%/58.39% recall rates for IV/OOV, respectively, were achieved for the CKIP and the CUHK test sets, respectively, using four combined features with equalization. When applied to unknown word extraction for a novice domain, this scheme can identify such pronouns as ”海角七號” (Cape No. 7, the name of a film), ”蠟筆小新” (Crayon Shinchan, the name of a cartoon figure), ”金融海嘯” (Financial Tsunami) and so on, which cannot be extracted reliably with rule-based approaches, although the approach appears not so good at identifying such terms as the names of humans, places, or organizations, for which the semantic structure is prominent. This scheme is complementary with the outcomes of two word segmentation systems, and is promising if other rule-based approaches could be further integrated.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信