基于统计机器翻译的改进图双语语料库选择与句子对排序

Wen-Han Chao, Zhoujun Li
{"title":"基于统计机器翻译的改进图双语语料库选择与句子对排序","authors":"Wen-Han Chao, Zhoujun Li","doi":"10.1109/ICTAI.2011.73","DOIUrl":null,"url":null,"abstract":"In statistical machine translation, the number of sentence pairs in the bilingual corpus is very important to the quality of translation. However, when the quantity reaches some extent, enlarging corpus has less effect on the translation, whereas increasing greatly the time and space complexity to building translation systems, which hinders the development of statistical machine translation. In this paper, we propose several ranking approaches to measure the quantity of information of each sentence pair, and apply them into a graph-based bilingual corpus selection framework to form an improved corpus selection approach, which now considers the difference of the initial quantities of information between the sentence pairs. Our experiments in a Chinese-English translation task show that, selecting only 50% of the whole corpus via the graph-based selection approach as training set, we can obtain the near translation result with the one using the whole corpus, and we obtain better results than the baselines after using the IDF-related ranking approach.","PeriodicalId":332661,"journal":{"name":"2011 IEEE 23rd International Conference on Tools with Artificial Intelligence","volume":"26 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2011-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":"{\"title\":\"Improved Graph-Based Bilingual Corpus Selection with Sentence Pair Ranking for Statistical Machine Translation\",\"authors\":\"Wen-Han Chao, Zhoujun Li\",\"doi\":\"10.1109/ICTAI.2011.73\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In statistical machine translation, the number of sentence pairs in the bilingual corpus is very important to the quality of translation. However, when the quantity reaches some extent, enlarging corpus has less effect on the translation, whereas increasing greatly the time and space complexity to building translation systems, which hinders the development of statistical machine translation. In this paper, we propose several ranking approaches to measure the quantity of information of each sentence pair, and apply them into a graph-based bilingual corpus selection framework to form an improved corpus selection approach, which now considers the difference of the initial quantities of information between the sentence pairs. Our experiments in a Chinese-English translation task show that, selecting only 50% of the whole corpus via the graph-based selection approach as training set, we can obtain the near translation result with the one using the whole corpus, and we obtain better results than the baselines after using the IDF-related ranking approach.\",\"PeriodicalId\":332661,\"journal\":{\"name\":\"2011 IEEE 23rd International Conference on Tools with Artificial Intelligence\",\"volume\":\"26 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2011-11-07\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"5\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2011 IEEE 23rd International Conference on Tools with Artificial Intelligence\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICTAI.2011.73\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2011 IEEE 23rd International Conference on Tools with Artificial Intelligence","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICTAI.2011.73","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 5

摘要

在统计机器翻译中,双语语料库中句子对的数量对翻译质量至关重要。然而,当数量达到一定程度时,扩大语料库对翻译的影响不大,反而大大增加了构建翻译系统的时间和空间复杂性,阻碍了统计机器翻译的发展。本文提出了几种衡量句子对信息量的排序方法,并将其应用到基于图的双语语料库选择框架中,形成了一种考虑句子对初始信息量差异的改进语料库选择方法。我们在汉英翻译任务中的实验表明,通过基于图的选择方法只选择整个语料库的50%作为训练集,我们可以获得与使用整个语料库的翻译结果接近的翻译结果,并且使用idf相关排序方法获得比基线更好的结果。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Improved Graph-Based Bilingual Corpus Selection with Sentence Pair Ranking for Statistical Machine Translation
In statistical machine translation, the number of sentence pairs in the bilingual corpus is very important to the quality of translation. However, when the quantity reaches some extent, enlarging corpus has less effect on the translation, whereas increasing greatly the time and space complexity to building translation systems, which hinders the development of statistical machine translation. In this paper, we propose several ranking approaches to measure the quantity of information of each sentence pair, and apply them into a graph-based bilingual corpus selection framework to form an improved corpus selection approach, which now considers the difference of the initial quantities of information between the sentence pairs. Our experiments in a Chinese-English translation task show that, selecting only 50% of the whole corpus via the graph-based selection approach as training set, we can obtain the near translation result with the one using the whole corpus, and we obtain better results than the baselines after using the IDF-related ranking approach.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信