Findings and Considerations in Active Learning Based Framework for Resource-Poor SMT

2013 International Conference on Asian Language Processing Pub Date : 2013-08-17 DOI:10.1109/IALP.2013.28

Jinhua Du, Meng Zhang

引用次数: 1

Abstract

Active learning (AL) for resource-poor SMT is an efficient and feasible way to acquire a number of high-quality parallel data to improve translation quality. This paper firstly studies two mainstream sentence selection algorithms that are Geom-phrase and Geom n-gram, and then proposes a sentence perplexity based selection method. Some important findings, such as the impact of sentence length on the AL performance, are observed in the comparison experiments conducted on Chinese-English NIST data. Accordingly, a preprocessing strategy is presented to filter the original monolingual corpus for the purpose of obtaining higher-information sentences. Experimental results on preprocessed data show that the the performance of three selection algorithms is significantly improved compared to the results on the original data.

查看原文本刊更多论文

资源贫乏的SMT基于主动学习框架的发现与思考

对于资源贫乏的SMT，主动学习是获取大量高质量并行数据以提高翻译质量的一种有效可行的方法。本文首先研究了geomo -phrase和Geom n-gram两种主流的句子选择算法，然后提出了一种基于句子困惑度的句子选择方法。在对汉英NIST数据进行的对比实验中，我们观察到一些重要的发现，如句子长度对人工智能性能的影响。在此基础上，提出了一种预处理策略，对原始单语语料库进行过滤，以获得高信息句子。在预处理数据上的实验结果表明，三种选择算法的性能与在原始数据上的结果相比有显著提高。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2013 International Conference on Asian Language Processing

自引率

0.00%

发文量