Fine-Tuning Self-Supervised Multilingual Sequence-To-Sequence Models for Extremely Low-Resource NMT

2021 Moratuwa Engineering Research Conference (MERCon) Pub Date : 2021-07-27 DOI:10.1109/MERCon52712.2021.9525720

Sarubi Thillainathan, Surangika Ranathunga, Sanath Jayasena

{"title":"Fine-Tuning Self-Supervised Multilingual Sequence-To-Sequence Models for Extremely Low-Resource NMT","authors":"Sarubi Thillainathan, Surangika Ranathunga, Sanath Jayasena","doi":"10.1109/MERCon52712.2021.9525720","DOIUrl":null,"url":null,"abstract":"Neural Machine Translation (NMT) tends to perform poorly in low-resource language settings due to the scarcity of parallel data. Instead of relying on inadequate parallel corpora, we can take advantage of monolingual data available in abundance. Training a denoising self-supervised multilingual sequence-to-sequence model by noising the available large scale monolingual corpora is one way to utilize monolingual data. For a pair of languages for which monolingual data is available in such a pre-trained multilingual denoising model, the model can be fine-tuned with a smaller amount of parallel data from this language pair. This paper presents fine-tuning self-supervised multilingual sequence-to-sequence pre-trained models for extremely low-resource domain-specific NMT settings. We choose one such pre-trained model: mBART. We are the first to implement and demonstrate the viability of non-English centric complete fine-tuning on multilingual sequence-to-sequence pre-trained models. We select Sinhala, Tamil and English languages to demonstrate fine-tuning on extremely low-resource settings in the domain of official government documents. Experiments show that our fine-tuned mBART model significantly outperforms state-of-the-art Transformer based NMT models in all pairs in all six bilingual directions, where we report a 4.41 BLEU score increase on Tamil→Sinhala and a 2.85 BLUE increase on Sinhala→ Tamil translation.","PeriodicalId":6855,"journal":{"name":"2021 Moratuwa Engineering Research Conference (MERCon)","volume":"32 1","pages":"432-437"},"PeriodicalIF":0.0000,"publicationDate":"2021-07-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"9","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 Moratuwa Engineering Research Conference (MERCon)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/MERCon52712.2021.9525720","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 9

Abstract

Neural Machine Translation (NMT) tends to perform poorly in low-resource language settings due to the scarcity of parallel data. Instead of relying on inadequate parallel corpora, we can take advantage of monolingual data available in abundance. Training a denoising self-supervised multilingual sequence-to-sequence model by noising the available large scale monolingual corpora is one way to utilize monolingual data. For a pair of languages for which monolingual data is available in such a pre-trained multilingual denoising model, the model can be fine-tuned with a smaller amount of parallel data from this language pair. This paper presents fine-tuning self-supervised multilingual sequence-to-sequence pre-trained models for extremely low-resource domain-specific NMT settings. We choose one such pre-trained model: mBART. We are the first to implement and demonstrate the viability of non-English centric complete fine-tuning on multilingual sequence-to-sequence pre-trained models. We select Sinhala, Tamil and English languages to demonstrate fine-tuning on extremely low-resource settings in the domain of official government documents. Experiments show that our fine-tuned mBART model significantly outperforms state-of-the-art Transformer based NMT models in all pairs in all six bilingual directions, where we report a 4.41 BLEU score increase on Tamil→Sinhala and a 2.85 BLUE increase on Sinhala→ Tamil translation.

查看原文本刊更多论文

极低资源NMT的微调自监督多语言序列到序列模型

由于并行数据的稀缺性，神经机器翻译(NMT)在低资源语言环境中往往表现不佳。我们可以利用丰富的单语数据，而不是依赖不充分的平行语料库。通过对可用的大规模单语语料库进行噪声处理，训练去噪的自监督多语序列到序列模型是利用单语数据的一种方法。对于在这种预训练的多语言去噪模型中有单语言数据可用的一对语言，可以使用来自该语言对的少量并行数据对模型进行微调。本文提出了微调自监督多语言序列到序列的预训练模型，用于极低资源领域特定的NMT设置。我们选择一个这样的预训练模型:mBART。我们是第一个在多语言序列到序列预训练模型上实现并证明非英语中心完全微调的可行性的人。我们选择僧伽罗语、泰米尔语和英语来演示在官方政府文件领域的极低资源设置上的微调。实验表明，我们的微调m巴特模型在所有六种双语方向的所有对中都明显优于最先进的基于Transformer的NMT模型，其中我们报告了泰米尔语→僧伽罗语翻译的BLEU分数增加4.41，僧伽罗语→泰米尔语翻译的BLUE分数增加2.85。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2021 Moratuwa Engineering Research Conference (MERCon)

自引率

0.00%

发文量