英语-阿姆哈拉语机器翻译的离线语料库增强

Yohannes Biadgligne, K. Smaïli
{"title":"英语-阿姆哈拉语机器翻译的离线语料库增强","authors":"Yohannes Biadgligne, K. Smaïli","doi":"10.1109/ICICT55905.2022.00030","DOIUrl":null,"url":null,"abstract":"The present paper investigates the effect of corpus augmentation on the quality of English-Amharic Machine Translation (MT) with the goal of improving translation quality of Statistical Machine Translation (SMT) and Neural Machine Translation (NMT) models for the language pairs. Actually, for the sake of this investigation tri-gram and four-gram SMT language models, as well as NMT models based on Gated Recurrent Units (GRU) and Recurrent Neural Network (RNN) models with attention mechanism were created. To observe how the corpus augmentation affects the translation quality of these models; we trained them separately by using our original corpus and the augmented one. These corpora (original and augmented) contain 225,304 and 450,608 English-Amharic parallel sentences, respectively. To complete the corpus augmentation challenge, an offline token level tokenization technique was used. This technique was used before any other MT processes were started. Among several token-level tokenization mechanisms, random insertion, replacement, deletion, and swapping approaches were chosen and implemented. After the models had been trained, the Bilingual Evaluation Understudy (BLEU) scores were collected and analyzed. The results demonstrate that the models trained with the augmented corpus outperform their corresponding models (models trained with the original corpus) in terms of BLEU scores. So, from this we can conclude that corpus augmentation did indeed help in the improvement of the performance of both SMT and NMT translation systems.","PeriodicalId":273927,"journal":{"name":"2022 5th International Conference on Information and Computer Technologies (ICICT)","volume":"22 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":"{\"title\":\"Offline Corpus Augmentation for English-Amharic Machine Translation\",\"authors\":\"Yohannes Biadgligne, K. Smaïli\",\"doi\":\"10.1109/ICICT55905.2022.00030\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The present paper investigates the effect of corpus augmentation on the quality of English-Amharic Machine Translation (MT) with the goal of improving translation quality of Statistical Machine Translation (SMT) and Neural Machine Translation (NMT) models for the language pairs. Actually, for the sake of this investigation tri-gram and four-gram SMT language models, as well as NMT models based on Gated Recurrent Units (GRU) and Recurrent Neural Network (RNN) models with attention mechanism were created. To observe how the corpus augmentation affects the translation quality of these models; we trained them separately by using our original corpus and the augmented one. These corpora (original and augmented) contain 225,304 and 450,608 English-Amharic parallel sentences, respectively. To complete the corpus augmentation challenge, an offline token level tokenization technique was used. This technique was used before any other MT processes were started. Among several token-level tokenization mechanisms, random insertion, replacement, deletion, and swapping approaches were chosen and implemented. After the models had been trained, the Bilingual Evaluation Understudy (BLEU) scores were collected and analyzed. The results demonstrate that the models trained with the augmented corpus outperform their corresponding models (models trained with the original corpus) in terms of BLEU scores. So, from this we can conclude that corpus augmentation did indeed help in the improvement of the performance of both SMT and NMT translation systems.\",\"PeriodicalId\":273927,\"journal\":{\"name\":\"2022 5th International Conference on Information and Computer Technologies (ICICT)\",\"volume\":\"22 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-03-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"3\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2022 5th International Conference on Information and Computer Technologies (ICICT)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICICT55905.2022.00030\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 5th International Conference on Information and Computer Technologies (ICICT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICICT55905.2022.00030","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3

摘要

本文研究了语料库增强对英语-阿姆哈拉语机器翻译质量的影响,旨在提高统计机器翻译(SMT)和神经机器翻译(NMT)模型对语言对的翻译质量。实际上,为了这个研究,我们创建了三格和四格的SMT语言模型,以及基于门控循环单元(GRU)的NMT模型和带有注意机制的递归神经网络(RNN)模型。观察语料库增强对模型翻译质量的影响;我们分别使用原始语料库和增强语料库对它们进行训练。这些语料库(原始语料库和增强语料库)分别包含225,304个和450,608个英语-阿姆哈拉语平行句。为了完成语料库增强挑战,使用了离线令牌级令牌化技术。这项技术是在任何其他MT过程开始之前使用的。在几种令牌级令牌化机制中,选择并实现了随机插入、替换、删除和交换方法。模型训练完成后,收集和分析双语评价替补(BLEU)得分。结果表明,使用增强语料库训练的模型在BLEU分数方面优于相应的模型(使用原始语料库训练的模型)。因此,由此我们可以得出结论,语料库增强确实有助于提高SMT和NMT翻译系统的性能。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Offline Corpus Augmentation for English-Amharic Machine Translation
The present paper investigates the effect of corpus augmentation on the quality of English-Amharic Machine Translation (MT) with the goal of improving translation quality of Statistical Machine Translation (SMT) and Neural Machine Translation (NMT) models for the language pairs. Actually, for the sake of this investigation tri-gram and four-gram SMT language models, as well as NMT models based on Gated Recurrent Units (GRU) and Recurrent Neural Network (RNN) models with attention mechanism were created. To observe how the corpus augmentation affects the translation quality of these models; we trained them separately by using our original corpus and the augmented one. These corpora (original and augmented) contain 225,304 and 450,608 English-Amharic parallel sentences, respectively. To complete the corpus augmentation challenge, an offline token level tokenization technique was used. This technique was used before any other MT processes were started. Among several token-level tokenization mechanisms, random insertion, replacement, deletion, and swapping approaches were chosen and implemented. After the models had been trained, the Bilingual Evaluation Understudy (BLEU) scores were collected and analyzed. The results demonstrate that the models trained with the augmented corpus outperform their corresponding models (models trained with the original corpus) in terms of BLEU scores. So, from this we can conclude that corpus augmentation did indeed help in the improvement of the performance of both SMT and NMT translation systems.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信