Using Mutual Information Criterion to Design an Effective Lexicon for Chinese Pinyin-to-Character Conversion

Wei Li, Jin-Song Zhang, Yanlu Xie, Xiaoyun Wang, M. Nishida, Seiichi Yamamoto
{"title":"Using Mutual Information Criterion to Design an Effective Lexicon for Chinese Pinyin-to-Character Conversion","authors":"Wei Li, Jin-Song Zhang, Yanlu Xie, Xiaoyun Wang, M. Nishida, Seiichi Yamamoto","doi":"10.1109/IALP.2013.37","DOIUrl":null,"url":null,"abstract":"Pinyin-to-character (P2C) conversion is mostly used to input Chinese characters into a computer. Its main problem is homophone words, which is solved through exploiting contextual information provided by lexicon and n-gram language model (LM). Our investigation about the state-of-the-art P2C technologies reveals that the methods of conventional optimization for them were almost based on minimizing text perplexity, however it is not directly related to the optimization of P2C performance. Therefore, we propose to use a new optimization criterion: mutual information (MI) between text corpus and its Pinyin script, to do self-supervised word segmentation, build a lexicon and estimate an n-gram LM, then use them to build P2C system. We realized the P2C system using newspaper corpus. Compared with the two baseline systems using handcrafted lexicon and perplexity based optimized lexicon, our system got relatively 19.7% and 10.3% error reductions on testing corpus respectively. The results show the efficiency of our proposal.","PeriodicalId":413833,"journal":{"name":"2013 International Conference on Asian Language Processing","volume":"39 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-08-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2013 International Conference on Asian Language Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IALP.2013.37","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Pinyin-to-character (P2C) conversion is mostly used to input Chinese characters into a computer. Its main problem is homophone words, which is solved through exploiting contextual information provided by lexicon and n-gram language model (LM). Our investigation about the state-of-the-art P2C technologies reveals that the methods of conventional optimization for them were almost based on minimizing text perplexity, however it is not directly related to the optimization of P2C performance. Therefore, we propose to use a new optimization criterion: mutual information (MI) between text corpus and its Pinyin script, to do self-supervised word segmentation, build a lexicon and estimate an n-gram LM, then use them to build P2C system. We realized the P2C system using newspaper corpus. Compared with the two baseline systems using handcrafted lexicon and perplexity based optimized lexicon, our system got relatively 19.7% and 10.3% error reductions on testing corpus respectively. The results show the efficiency of our proposal.
用互信息准则设计有效的汉语拼音文字转换词典
拼音-字符转换(P2C)主要用于向计算机输入汉字。其主要问题是同音词,通过利用词汇提供的上下文信息和n-gram语言模型(LM)来解决同音词问题。我们对最先进的P2C技术的调查表明,传统的优化方法几乎是基于最小化文本困惑,但它与P2C性能的优化没有直接关系。因此,我们提出了一种新的优化准则:文本语料库与其拼音脚本之间的互信息(MI),进行自监督分词,构建词典和估计n-gram LM,然后利用它们构建P2C系统。我们利用报纸语料库实现了P2C系统。与使用手工词典和基于困惑度的优化词典的基线系统相比,我们的系统在测试语料库上的错误率分别降低了19.7%和10.3%。结果表明我们的建议是有效的。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信