Using Mutual Information Criterion to Design an Effective Lexicon for Chinese Pinyin-to-Character Conversion

2013 International Conference on Asian Language Processing Pub Date : 2013-08-17 DOI:10.1109/IALP.2013.37

Wei Li, Jin-Song Zhang, Yanlu Xie, Xiaoyun Wang, M. Nishida, Seiichi Yamamoto

{"title":"Using Mutual Information Criterion to Design an Effective Lexicon for Chinese Pinyin-to-Character Conversion","authors":"Wei Li, Jin-Song Zhang, Yanlu Xie, Xiaoyun Wang, M. Nishida, Seiichi Yamamoto","doi":"10.1109/IALP.2013.37","DOIUrl":null,"url":null,"abstract":"Pinyin-to-character (P2C) conversion is mostly used to input Chinese characters into a computer. Its main problem is homophone words, which is solved through exploiting contextual information provided by lexicon and n-gram language model (LM). Our investigation about the state-of-the-art P2C technologies reveals that the methods of conventional optimization for them were almost based on minimizing text perplexity, however it is not directly related to the optimization of P2C performance. Therefore, we propose to use a new optimization criterion: mutual information (MI) between text corpus and its Pinyin script, to do self-supervised word segmentation, build a lexicon and estimate an n-gram LM, then use them to build P2C system. We realized the P2C system using newspaper corpus. Compared with the two baseline systems using handcrafted lexicon and perplexity based optimized lexicon, our system got relatively 19.7% and 10.3% error reductions on testing corpus respectively. The results show the efficiency of our proposal.","PeriodicalId":413833,"journal":{"name":"2013 International Conference on Asian Language Processing","volume":"39 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-08-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2013 International Conference on Asian Language Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IALP.2013.37","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Pinyin-to-character (P2C) conversion is mostly used to input Chinese characters into a computer. Its main problem is homophone words, which is solved through exploiting contextual information provided by lexicon and n-gram language model (LM). Our investigation about the state-of-the-art P2C technologies reveals that the methods of conventional optimization for them were almost based on minimizing text perplexity, however it is not directly related to the optimization of P2C performance. Therefore, we propose to use a new optimization criterion: mutual information (MI) between text corpus and its Pinyin script, to do self-supervised word segmentation, build a lexicon and estimate an n-gram LM, then use them to build P2C system. We realized the P2C system using newspaper corpus. Compared with the two baseline systems using handcrafted lexicon and perplexity based optimized lexicon, our system got relatively 19.7% and 10.3% error reductions on testing corpus respectively. The results show the efficiency of our proposal.

查看原文本刊更多论文

用互信息准则设计有效的汉语拼音文字转换词典

拼音-字符转换(P2C)主要用于向计算机输入汉字。其主要问题是同音词，通过利用词汇提供的上下文信息和n-gram语言模型(LM)来解决同音词问题。我们对最先进的P2C技术的调查表明，传统的优化方法几乎是基于最小化文本困惑，但它与P2C性能的优化没有直接关系。因此，我们提出了一种新的优化准则:文本语料库与其拼音脚本之间的互信息(MI)，进行自监督分词，构建词典和估计n-gram LM，然后利用它们构建P2C系统。我们利用报纸语料库实现了P2C系统。与使用手工词典和基于困惑度的优化词典的基线系统相比，我们的系统在测试语料库上的错误率分别降低了19.7%和10.3%。结果表明我们的建议是有效的。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2013 International Conference on Asian Language Processing

自引率

0.00%

发文量