Extracting parallel phrases from comparable corpora

Jiexin Zhang, Hailong Cao, T. Zhao
{"title":"Extracting parallel phrases from comparable corpora","authors":"Jiexin Zhang, Hailong Cao, T. Zhao","doi":"10.1109/IALP.2014.6973501","DOIUrl":null,"url":null,"abstract":"The state-of-the-art statistical machine translation models are trained with the parallel corpora. However, the traditional SMT loses its power when it comes to language pairs with few bilingual resources. This paper proposes a novel method that treats the phrase extraction as a classification task. We first automatically generate the training and testing phrase pairs for the classifier. Then, we train a SVM classifier which can determine the phrase pairs are either parallel or non-parallel. The proposed approach is evaluated on the translation task of Chinese-English. Experimental results show that the precision of the classifier on test sets is above 70% and the accuracy is above 98% The quality of the extracted data is also evaluated by measuring the impact on the performance of a state-of-the-art SMT system, which is built with a small parallel corpus. It shows better results over the baseline system.","PeriodicalId":117334,"journal":{"name":"2014 International Conference on Asian Language Processing (IALP)","volume":"42 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 International Conference on Asian Language Processing (IALP)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IALP.2014.6973501","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1

Abstract

The state-of-the-art statistical machine translation models are trained with the parallel corpora. However, the traditional SMT loses its power when it comes to language pairs with few bilingual resources. This paper proposes a novel method that treats the phrase extraction as a classification task. We first automatically generate the training and testing phrase pairs for the classifier. Then, we train a SVM classifier which can determine the phrase pairs are either parallel or non-parallel. The proposed approach is evaluated on the translation task of Chinese-English. Experimental results show that the precision of the classifier on test sets is above 70% and the accuracy is above 98% The quality of the extracted data is also evaluated by measuring the impact on the performance of a state-of-the-art SMT system, which is built with a small parallel corpus. It shows better results over the baseline system.
从可比语料库中提取平行短语
用并行语料库训练最先进的统计机器翻译模型。然而,当涉及到双语资源很少的语言对时,传统的SMT就失去了它的力量。本文提出了一种将短语提取作为分类任务的新方法。我们首先为分类器自动生成训练和测试短语对。然后,我们训练了一个支持向量机分类器,它可以判断短语对是并行的还是非并行的。在汉英翻译任务中对该方法进行了评价。实验结果表明,该分类器在测试集上的准确率在70%以上,准确率在98%以上,并通过测量对最先进的SMT系统性能的影响来评估提取数据的质量,该系统使用小型并行语料库构建。它显示了比基线系统更好的结果。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信