Error feedback based lexical entity extraction for Chinese language modeling

2013 6th International Congress on Image and Signal Processing (CISP) Pub Date : 2013-12-01 DOI:10.1109/CISP.2013.6743873

Yi Liu, Jing Hua, Xiangang Li, Xihong Wu

引用次数: 0

Abstract

Chinese, which is quite different from western languages, has no standard definition of word. Therefore, choosing suitable lexicon plays an important role in Chinese language modeling. This paper proposes a novel method of constructing the lexicon automatically. Other than depending on statistical measures of text features, this method is directly based on the feedback of errors from the corresponding task, such as phoneme-to-grapheme conversion in this paper. The whole process consists of two iterative phases: selection of individual words from a large manual lexicon and further extraction of compound words based on Phase One. Experiments implemented on phoneme-to-grapheme conversion show that this method can achieve 1.09% and 0.38% absolute reduction in character error rate respectively for Phase One and Phase Two compared with baseline lexicons in the same size generated by the conventional method based on word frequency.

查看原文本刊更多论文

基于错误反馈的汉语词汇实体抽取

汉语与西方语言有很大的不同，它没有标准的词的定义。因此，选择合适的词汇在汉语语言建模中起着重要的作用。本文提出了一种自动构建词典的新方法。该方法不依赖于文本特征的统计度量，而是直接基于相应任务的错误反馈，例如本文中的音素-字素转换。整个过程包括两个迭代阶段:从大型人工词典中选择单个单词和在阶段一的基础上进一步提取复合词。音素-字素转换实验表明，与基于词频的常规方法生成的相同大小的基线词汇相比，该方法在第一阶段和第二阶段的字符错误率分别降低了1.09%和0.38%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2013 6th International Congress on Image and Signal Processing (CISP)

自引率

0.00%

发文量