Extracting parallel phrases from comparable corpora

2014 International Conference on Asian Language Processing (IALP) Pub Date : 2014-10-01 DOI:10.1109/IALP.2014.6973501

Jiexin Zhang, Hailong Cao, T. Zhao

引用次数: 1

Abstract

The state-of-the-art statistical machine translation models are trained with the parallel corpora. However, the traditional SMT loses its power when it comes to language pairs with few bilingual resources. This paper proposes a novel method that treats the phrase extraction as a classification task. We first automatically generate the training and testing phrase pairs for the classifier. Then, we train a SVM classifier which can determine the phrase pairs are either parallel or non-parallel. The proposed approach is evaluated on the translation task of Chinese-English. Experimental results show that the precision of the classifier on test sets is above 70% and the accuracy is above 98% The quality of the extracted data is also evaluated by measuring the impact on the performance of a state-of-the-art SMT system, which is built with a small parallel corpus. It shows better results over the baseline system.

查看原文本刊更多论文

从可比语料库中提取平行短语

用并行语料库训练最先进的统计机器翻译模型。然而，当涉及到双语资源很少的语言对时，传统的SMT就失去了它的力量。本文提出了一种将短语提取作为分类任务的新方法。我们首先为分类器自动生成训练和测试短语对。然后，我们训练了一个支持向量机分类器，它可以判断短语对是并行的还是非并行的。在汉英翻译任务中对该方法进行了评价。实验结果表明，该分类器在测试集上的准确率在70%以上，准确率在98%以上，并通过测量对最先进的SMT系统性能的影响来评估提取数据的质量，该系统使用小型并行语料库构建。它显示了比基线系统更好的结果。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2014 International Conference on Asian Language Processing (IALP)

自引率

0.00%

发文量