Bilingual Segmenter for Statistical Machine Translation

2008 Second International Symposium on Universal Communication Pub Date : 2008-12-15 DOI:10.1109/ISUC.2008.10

Chung-Chi Huang, Wei-Teh Chen, Jason J. S. Chang

引用次数: 1

Abstract

We propose a bilingually-motivated segmenting framework for Chinese which has no clear delimiter for word boundaries. It involves producing Chinese tokens in line with word-based languages¿ words using a bilingual segmenting algorithm, provided with bitexts, and deriving a probabilistic tokenizing model based on previously annotated Chinese sentences. In the bilingual segmenting algorithm, we first convert the search for segmentation into a sequential tagging problem, allowing for a polynomial-time dynamic programming solution, and incorporate a control to balance mono- and bi-lingual information in tailoring Chinese sentences. Experiments show that our framework, applied as a pre-tokenization component, significantly outperforms existing segmenters in translation quality, suggesting our methodology supports better segmentation for bilingual NLP applications involving isolated languages such as Chinese.

查看原文本刊更多论文

统计机器翻译的双语分词器

我们提出了一个双语驱动的汉语分词框架，该框架没有明确的词边界分隔符。它包括使用双语分词算法生成与基于单词的语言一致的中文标记，并提供bittext，并基于先前注释的中文句子导出概率标记模型。在双语分词算法中，我们首先将分词搜索转换为一个顺序标注问题，允许多项式时间动态规划解决方案，并在裁剪中文句子时加入一个控制来平衡单语和双语信息。实验表明，我们的框架作为预标记化组件，在翻译质量上明显优于现有的切分器，这表明我们的方法支持涉及孤立语言(如汉语)的双语NLP应用程序更好的切分。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2008 Second International Symposium on Universal Communication

自引率

0.00%

发文量