Efficient incremental training using a novel NMT-SMT hybrid framework for translation of low-resource languages.

IF 3 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE
Frontiers in Artificial Intelligence Pub Date : 2024-09-25 eCollection Date: 2024-01-01 DOI:10.3389/frai.2024.1381290
Kumar Bhuvaneswari, Murugesan Varalakshmi
{"title":"Efficient incremental training using a novel NMT-SMT hybrid framework for translation of low-resource languages.","authors":"Kumar Bhuvaneswari, Murugesan Varalakshmi","doi":"10.3389/frai.2024.1381290","DOIUrl":null,"url":null,"abstract":"<p><p>The data-hungry statistical machine translation (SMT) and neural machine translation (NMT) models offer state-of-the-art results for languages with abundant data resources. However, extensive research is imperative to make these models perform equally well for low-resource languages. This paper proposes a novel approach to integrate the best features of the NMT and SMT systems for improved translation performance of low-resource English-Tamil language pair. The suboptimal NMT model trained with the small parallel corpus translates the monolingual corpus and selects only the best translations, to retrain itself in the next iteration. The proposed method employs the SMT phrase-pair table to determine the best translations, based on the maximum match between the words of the phrase-pair dictionary and each of the individual translations. This repeating cycle of translation and retraining generates a large quasi-parallel corpus, thus making the NMT model more powerful. SMT-integrated incremental training demonstrates a substantial difference in translation performance as compared to the existing approaches for incremental training. The model is strengthened further by adopting a beam search decoding strategy to produce <i>k</i> best possible translations for each input sentence. Empirical findings prove that the proposed model with BLEU scores of 19.56 and 23.49 outperforms the baseline NMT with scores 11.06 and 17.06 for Eng-to-Tam and Tam-to-Eng translations, respectively. METEOR score evaluation further corroborates these results, proving the supremacy of the proposed model.</p>","PeriodicalId":33315,"journal":{"name":"Frontiers in Artificial Intelligence","volume":"7 ","pages":"1381290"},"PeriodicalIF":3.0000,"publicationDate":"2024-09-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11461459/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Frontiers in Artificial Intelligence","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3389/frai.2024.1381290","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/1/1 0:00:00","PubModel":"eCollection","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0

Abstract

The data-hungry statistical machine translation (SMT) and neural machine translation (NMT) models offer state-of-the-art results for languages with abundant data resources. However, extensive research is imperative to make these models perform equally well for low-resource languages. This paper proposes a novel approach to integrate the best features of the NMT and SMT systems for improved translation performance of low-resource English-Tamil language pair. The suboptimal NMT model trained with the small parallel corpus translates the monolingual corpus and selects only the best translations, to retrain itself in the next iteration. The proposed method employs the SMT phrase-pair table to determine the best translations, based on the maximum match between the words of the phrase-pair dictionary and each of the individual translations. This repeating cycle of translation and retraining generates a large quasi-parallel corpus, thus making the NMT model more powerful. SMT-integrated incremental training demonstrates a substantial difference in translation performance as compared to the existing approaches for incremental training. The model is strengthened further by adopting a beam search decoding strategy to produce k best possible translations for each input sentence. Empirical findings prove that the proposed model with BLEU scores of 19.56 and 23.49 outperforms the baseline NMT with scores 11.06 and 17.06 for Eng-to-Tam and Tam-to-Eng translations, respectively. METEOR score evaluation further corroborates these results, proving the supremacy of the proposed model.

使用新型 NMT-SMT 混合框架对低资源语言翻译进行高效增量训练。
对数据要求极高的统计机器翻译(SMT)和神经机器翻译(NMT)模型可为数据资源丰富的语言提供最先进的结果。然而,要使这些模型在低资源语言中同样表现出色,广泛的研究势在必行。本文提出了一种整合 NMT 和 SMT 系统最佳功能的新方法,以提高低资源英语-泰米尔语对的翻译性能。使用小型平行语料库训练的次优 NMT 模型翻译单语语料库,并只选择最佳翻译,以便在下一次迭代中重新训练自己。建议的方法采用 SMT 短语对表,根据短语对词典中的单词与每个单个译文之间的最大匹配度来确定最佳译文。这种重复的翻译和再训练循环会产生一个大型准平行语料库,从而使 NMT 模型更加强大。与现有的增量训练方法相比,集成 SMT 的增量训练在翻译性能上有很大的不同。通过采用波束搜索解码策略为每个输入句子生成 k 个最佳译文,该模型得到了进一步加强。实证结果证明,在英译潭和潭译英的翻译中,拟议模型的 BLEU 得分分别为 19.56 和 23.49,优于基准 NMT 的 11.06 和 17.06。METEOR 分数评估进一步证实了这些结果,证明了所提出模型的优越性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
CiteScore
6.10
自引率
2.50%
发文量
272
审稿时长
13 weeks
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信