Rethinking the Exploitation of Monolingual Data for Low-Resource Neural Machine Translation

IF 9.3 2区计算机科学

Computational Linguistics Pub Date : 2023-11-15 DOI:10.1162/coli_a_00496

Jianhui Pang, Derek Fai Wong, Dayiheng Liu, Jun Xie, Baosong Yang, Yu Wan, Lidia Sam Chao

{"title":"Rethinking the Exploitation of Monolingual Data for Low-Resource Neural Machine Translation","authors":"Jianhui Pang, Derek Fai Wong, Dayiheng Liu, Jun Xie, Baosong Yang, Yu Wan, Lidia Sam Chao","doi":"10.1162/coli_a_00496","DOIUrl":null,"url":null,"abstract":"The utilization of monolingual data has been shown to be a promising strategy for addressing low-resource machine translation problems. Previous studies have demonstrated the effectiveness of techniques such as Back-Translation and self-supervised objectives, including Masked Language Modeling, Causal Language Modeling, and Denoise Autoencoding, in improving the performance of machine translation models. However, the manner in which these methods contribute to the success of machine translation tasks and how they can be effectively combined remains an under-researched area. In this study, we carry out a systematic investigation of the effects of these techniques on linguistic properties through the use of probing tasks, including source language comprehension, bilingual word alignment, and translation fluency. We further evaluate the impact of Pre-Training, Back-Translation, and Multi-Task Learning on bitexts of varying sizes. Our findings inform the design of more effective pipelines for leveraging monolingual data in extremely low-resource and low-resource machine translation tasks. Experiment results show consistent performance gains in seven translation directions, which provide further support for our conclusions and understanding of the role of monolingual data in machine translation.","PeriodicalId":49089,"journal":{"name":"Computational Linguistics","volume":"64 1","pages":""},"PeriodicalIF":9.3000,"publicationDate":"2023-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computational Linguistics","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1162/coli_a_00496","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

The utilization of monolingual data has been shown to be a promising strategy for addressing low-resource machine translation problems. Previous studies have demonstrated the effectiveness of techniques such as Back-Translation and self-supervised objectives, including Masked Language Modeling, Causal Language Modeling, and Denoise Autoencoding, in improving the performance of machine translation models. However, the manner in which these methods contribute to the success of machine translation tasks and how they can be effectively combined remains an under-researched area. In this study, we carry out a systematic investigation of the effects of these techniques on linguistic properties through the use of probing tasks, including source language comprehension, bilingual word alignment, and translation fluency. We further evaluate the impact of Pre-Training, Back-Translation, and Multi-Task Learning on bitexts of varying sizes. Our findings inform the design of more effective pipelines for leveraging monolingual data in extremely low-resource and low-resource machine translation tasks. Experiment results show consistent performance gains in seven translation directions, which provide further support for our conclusions and understanding of the role of monolingual data in machine translation.

查看原文本刊更多论文

低资源神经机器翻译中单语数据开发的再思考

单语数据的利用已被证明是解决低资源机器翻译问题的一种有前途的策略。先前的研究已经证明了反翻译和自监督目标等技术(包括掩码语言建模、因果语言建模和去噪自动编码)在提高机器翻译模型性能方面的有效性。然而，这些方法如何促进机器翻译任务的成功以及如何有效地将它们结合起来仍然是一个研究不足的领域。在这项研究中，我们通过使用探索性任务，包括源语理解、双语词对齐和翻译流畅性，对这些技术对语言特性的影响进行了系统的调查。我们进一步评估了预训练、反翻译和多任务学习对不同大小文本的影响。我们的研究结果为在极低资源和低资源机器翻译任务中利用单语言数据设计更有效的管道提供了信息。实验结果表明，在7个翻译方向上的性能提升是一致的，这进一步支持了我们的结论和对单语数据在机器翻译中的作用的理解。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Computational Linguistics Computer Science-Artificial Intelligence

自引率

0.00%

发文量

期刊介绍： Computational Linguistics is the longest-running publication devoted exclusively to the computational and mathematical properties of language and the design and analysis of natural language processing systems. This highly regarded quarterly offers university and industry linguists, computational linguists, artificial intelligence and machine learning investigators, cognitive scientists, speech specialists, and philosophers the latest information about the computational aspects of all the facets of research on language.