基于统计机器翻译的改进图双语语料库选择与句子对排序

2011 IEEE 23rd International Conference on Tools with Artificial Intelligence Pub Date : 2011-11-07 DOI:10.1109/ICTAI.2011.73

Wen-Han Chao, Zhoujun Li

{"title":"基于统计机器翻译的改进图双语语料库选择与句子对排序","authors":"Wen-Han Chao, Zhoujun Li","doi":"10.1109/ICTAI.2011.73","DOIUrl":null,"url":null,"abstract":"In statistical machine translation, the number of sentence pairs in the bilingual corpus is very important to the quality of translation. However, when the quantity reaches some extent, enlarging corpus has less effect on the translation, whereas increasing greatly the time and space complexity to building translation systems, which hinders the development of statistical machine translation. In this paper, we propose several ranking approaches to measure the quantity of information of each sentence pair, and apply them into a graph-based bilingual corpus selection framework to form an improved corpus selection approach, which now considers the difference of the initial quantities of information between the sentence pairs. Our experiments in a Chinese-English translation task show that, selecting only 50% of the whole corpus via the graph-based selection approach as training set, we can obtain the near translation result with the one using the whole corpus, and we obtain better results than the baselines after using the IDF-related ranking approach.","PeriodicalId":332661,"journal":{"name":"2011 IEEE 23rd International Conference on Tools with Artificial Intelligence","volume":"26 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2011-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":"{\"title\":\"Improved Graph-Based Bilingual Corpus Selection with Sentence Pair Ranking for Statistical Machine Translation\",\"authors\":\"Wen-Han Chao, Zhoujun Li\",\"doi\":\"10.1109/ICTAI.2011.73\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In statistical machine translation, the number of sentence pairs in the bilingual corpus is very important to the quality of translation. However, when the quantity reaches some extent, enlarging corpus has less effect on the translation, whereas increasing greatly the time and space complexity to building translation systems, which hinders the development of statistical machine translation. In this paper, we propose several ranking approaches to measure the quantity of information of each sentence pair, and apply them into a graph-based bilingual corpus selection framework to form an improved corpus selection approach, which now considers the difference of the initial quantities of information between the sentence pairs. Our experiments in a Chinese-English translation task show that, selecting only 50% of the whole corpus via the graph-based selection approach as training set, we can obtain the near translation result with the one using the whole corpus, and we obtain better results than the baselines after using the IDF-related ranking approach.\",\"PeriodicalId\":332661,\"journal\":{\"name\":\"2011 IEEE 23rd International Conference on Tools with Artificial Intelligence\",\"volume\":\"26 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2011-11-07\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"5\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2011 IEEE 23rd International Conference on Tools with Artificial Intelligence\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICTAI.2011.73\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2011 IEEE 23rd International Conference on Tools with Artificial Intelligence","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICTAI.2011.73","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 5

摘要

在统计机器翻译中，双语语料库中句子对的数量对翻译质量至关重要。然而，当数量达到一定程度时，扩大语料库对翻译的影响不大，反而大大增加了构建翻译系统的时间和空间复杂性，阻碍了统计机器翻译的发展。本文提出了几种衡量句子对信息量的排序方法，并将其应用到基于图的双语语料库选择框架中，形成了一种考虑句子对初始信息量差异的改进语料库选择方法。我们在汉英翻译任务中的实验表明，通过基于图的选择方法只选择整个语料库的50%作为训练集，我们可以获得与使用整个语料库的翻译结果接近的翻译结果，并且使用idf相关排序方法获得比基线更好的结果。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Improved Graph-Based Bilingual Corpus Selection with Sentence Pair Ranking for Statistical Machine Translation

In statistical machine translation, the number of sentence pairs in the bilingual corpus is very important to the quality of translation. However, when the quantity reaches some extent, enlarging corpus has less effect on the translation, whereas increasing greatly the time and space complexity to building translation systems, which hinders the development of statistical machine translation. In this paper, we propose several ranking approaches to measure the quantity of information of each sentence pair, and apply them into a graph-based bilingual corpus selection framework to form an improved corpus selection approach, which now considers the difference of the initial quantities of information between the sentence pairs. Our experiments in a Chinese-English translation task show that, selecting only 50% of the whole corpus via the graph-based selection approach as training set, we can obtain the near translation result with the one using the whole corpus, and we obtain better results than the baselines after using the IDF-related ranking approach.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2011 IEEE 23rd International Conference on Tools with Artificial Intelligence

自引率

0.00%

发文量