{"title":"极低资源NMT的微调自监督多语言序列到序列模型","authors":"Sarubi Thillainathan, Surangika Ranathunga, Sanath Jayasena","doi":"10.1109/MERCon52712.2021.9525720","DOIUrl":null,"url":null,"abstract":"Neural Machine Translation (NMT) tends to perform poorly in low-resource language settings due to the scarcity of parallel data. Instead of relying on inadequate parallel corpora, we can take advantage of monolingual data available in abundance. Training a denoising self-supervised multilingual sequence-to-sequence model by noising the available large scale monolingual corpora is one way to utilize monolingual data. For a pair of languages for which monolingual data is available in such a pre-trained multilingual denoising model, the model can be fine-tuned with a smaller amount of parallel data from this language pair. This paper presents fine-tuning self-supervised multilingual sequence-to-sequence pre-trained models for extremely low-resource domain-specific NMT settings. We choose one such pre-trained model: mBART. We are the first to implement and demonstrate the viability of non-English centric complete fine-tuning on multilingual sequence-to-sequence pre-trained models. We select Sinhala, Tamil and English languages to demonstrate fine-tuning on extremely low-resource settings in the domain of official government documents. Experiments show that our fine-tuned mBART model significantly outperforms state-of-the-art Transformer based NMT models in all pairs in all six bilingual directions, where we report a 4.41 BLEU score increase on Tamil→Sinhala and a 2.85 BLUE increase on Sinhala→ Tamil translation.","PeriodicalId":6855,"journal":{"name":"2021 Moratuwa Engineering Research Conference (MERCon)","volume":"32 1","pages":"432-437"},"PeriodicalIF":0.0000,"publicationDate":"2021-07-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"9","resultStr":"{\"title\":\"Fine-Tuning Self-Supervised Multilingual Sequence-To-Sequence Models for Extremely Low-Resource NMT\",\"authors\":\"Sarubi Thillainathan, Surangika Ranathunga, Sanath Jayasena\",\"doi\":\"10.1109/MERCon52712.2021.9525720\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Neural Machine Translation (NMT) tends to perform poorly in low-resource language settings due to the scarcity of parallel data. Instead of relying on inadequate parallel corpora, we can take advantage of monolingual data available in abundance. Training a denoising self-supervised multilingual sequence-to-sequence model by noising the available large scale monolingual corpora is one way to utilize monolingual data. For a pair of languages for which monolingual data is available in such a pre-trained multilingual denoising model, the model can be fine-tuned with a smaller amount of parallel data from this language pair. This paper presents fine-tuning self-supervised multilingual sequence-to-sequence pre-trained models for extremely low-resource domain-specific NMT settings. We choose one such pre-trained model: mBART. We are the first to implement and demonstrate the viability of non-English centric complete fine-tuning on multilingual sequence-to-sequence pre-trained models. We select Sinhala, Tamil and English languages to demonstrate fine-tuning on extremely low-resource settings in the domain of official government documents. Experiments show that our fine-tuned mBART model significantly outperforms state-of-the-art Transformer based NMT models in all pairs in all six bilingual directions, where we report a 4.41 BLEU score increase on Tamil→Sinhala and a 2.85 BLUE increase on Sinhala→ Tamil translation.\",\"PeriodicalId\":6855,\"journal\":{\"name\":\"2021 Moratuwa Engineering Research Conference (MERCon)\",\"volume\":\"32 1\",\"pages\":\"432-437\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-07-27\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"9\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2021 Moratuwa Engineering Research Conference (MERCon)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/MERCon52712.2021.9525720\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 Moratuwa Engineering Research Conference (MERCon)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/MERCon52712.2021.9525720","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Fine-Tuning Self-Supervised Multilingual Sequence-To-Sequence Models for Extremely Low-Resource NMT
Neural Machine Translation (NMT) tends to perform poorly in low-resource language settings due to the scarcity of parallel data. Instead of relying on inadequate parallel corpora, we can take advantage of monolingual data available in abundance. Training a denoising self-supervised multilingual sequence-to-sequence model by noising the available large scale monolingual corpora is one way to utilize monolingual data. For a pair of languages for which monolingual data is available in such a pre-trained multilingual denoising model, the model can be fine-tuned with a smaller amount of parallel data from this language pair. This paper presents fine-tuning self-supervised multilingual sequence-to-sequence pre-trained models for extremely low-resource domain-specific NMT settings. We choose one such pre-trained model: mBART. We are the first to implement and demonstrate the viability of non-English centric complete fine-tuning on multilingual sequence-to-sequence pre-trained models. We select Sinhala, Tamil and English languages to demonstrate fine-tuning on extremely low-resource settings in the domain of official government documents. Experiments show that our fine-tuned mBART model significantly outperforms state-of-the-art Transformer based NMT models in all pairs in all six bilingual directions, where we report a 4.41 BLEU score increase on Tamil→Sinhala and a 2.85 BLUE increase on Sinhala→ Tamil translation.