Domain-Specific Text Generation for Machine Translation

Conference of the Association for Machine Translation in the Americas Pub Date : 2022-08-11 DOI:10.48550/arXiv.2208.05909

Yasmin Moslem, Rejwanul Haque, John D. Kelleher, Andy Way

{"title":"Domain-Specific Text Generation for Machine Translation","authors":"Yasmin Moslem, Rejwanul Haque, John D. Kelleher, Andy Way","doi":"10.48550/arXiv.2208.05909","DOIUrl":null,"url":null,"abstract":"Preservation of domain knowledge from the source to target is crucial in any translation workflow. It is common in the translation industry to receive highly-specialized projects, where there is hardly any parallel in-domain data. In such scenarios where there is insufficient in-domain data to fine-tune Machine Translation (MT) models, producing translations that are consistent with the relevant context is challenging. In this work, we propose leveraging state-of-the-art pretrained language models (LMs) for domain-specific data augmentation for MT, simulating the domain characteristics of either (a) a small bilingual dataset, or (b) the monolingual source text to be translated. Combining this idea with back-translation, we can generate huge amounts of synthetic bilingual in-domain data for both use cases. For our investigation, we used the state-of-the-art MT architecture, Transformer. We employed mixed fine-tuning to train models that significantly improve translation of in-domain texts. More specifically, our proposed methods achieved improvements of approximately 5-6 BLEU and 2-3 BLEU, respectively, on Arabic-to-English and English-to-Arabic language pairs. Furthermore, the outcome of human evaluation corroborates the automatic evaluation results.","PeriodicalId":201231,"journal":{"name":"Conference of the Association for Machine Translation in the Americas","volume":"17 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Conference of the Association for Machine Translation in the Americas","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.48550/arXiv.2208.05909","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 5

Abstract

Preservation of domain knowledge from the source to target is crucial in any translation workflow. It is common in the translation industry to receive highly-specialized projects, where there is hardly any parallel in-domain data. In such scenarios where there is insufficient in-domain data to fine-tune Machine Translation (MT) models, producing translations that are consistent with the relevant context is challenging. In this work, we propose leveraging state-of-the-art pretrained language models (LMs) for domain-specific data augmentation for MT, simulating the domain characteristics of either (a) a small bilingual dataset, or (b) the monolingual source text to be translated. Combining this idea with back-translation, we can generate huge amounts of synthetic bilingual in-domain data for both use cases. For our investigation, we used the state-of-the-art MT architecture, Transformer. We employed mixed fine-tuning to train models that significantly improve translation of in-domain texts. More specifically, our proposed methods achieved improvements of approximately 5-6 BLEU and 2-3 BLEU, respectively, on Arabic-to-English and English-to-Arabic language pairs. Furthermore, the outcome of human evaluation corroborates the automatic evaluation results.

查看原文本刊更多论文

用于机器翻译的特定领域文本生成

保存从源到目标的领域知识在任何翻译工作流程中都是至关重要的。在翻译行业中，接受高度专业化的项目是很常见的，在这些项目中几乎没有任何并行的领域内数据。在没有足够的域内数据来微调机器翻译(MT)模型的情况下，生成与相关上下文一致的翻译是具有挑战性的。在这项工作中，我们建议利用最先进的预训练语言模型(LMs)来增强机器翻译的特定领域数据，模拟(a)小型双语数据集或(b)待翻译的单语源文本的领域特征。将这个想法与反向翻译相结合，我们可以为这两个用例生成大量的合成双语域内数据。在我们的调查中，我们使用了最先进的MT体系结构Transformer。我们使用混合微调来训练模型，显著提高了域内文本的翻译。更具体地说，我们提出的方法分别在阿拉伯语到英语和英语到阿拉伯语对上实现了大约5-6 BLEU和2-3 BLEU的改进。此外，人工评价的结果与自动评价的结果相吻合。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Conference of the Association for Machine Translation in the Americas

自引率

0.00%

发文量