Pseudo-kNN-MT: Enhancing domain adaptability of neural machine translation via target language data

IF 7.6 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Knowledge-Based Systems Pub Date : 2025-09-23 DOI:10.1016/j.knosys.2025.114513

Abudurexiti Reheman , Yingfeng Luo , Junhao Ruan , Hongyu Liu , Tong Xiao , Jingbo Zhu

{"title":"Pseudo-kNN-MT: Enhancing domain adaptability of neural machine translation via target language data","authors":"Abudurexiti Reheman , Yingfeng Luo , Junhao Ruan , Hongyu Liu , Tong Xiao , Jingbo Zhu","doi":"10.1016/j.knosys.2025.114513","DOIUrl":null,"url":null,"abstract":"<div><div>Although Neural Machine Translation (NMT) has recently achieved remarkable performance improvements, it still faces challenges in domain adaptation. Previous research has focused on mitigating this issue by integrating translation knowledge from bilingual domain data. However, the limited availability of bilingual translation resources has constrained these methods in real world application. To address this inadequacy, solutions based on monolingual data, such as back-translation, have been proposed. Nevertheless, these methods often incur additional training costs due to the necessity of training reverse models to generate pseudo data. In light of this, we propose Pseudo-<span><math><mi>k</mi></math></span>NN-MT, which does not require additional training. This method creates pseudo-bilingual data pairs by retrieving semantically similar sentences from target language data and subsequently builds the <span><math><mi>k</mi></math></span>NN datastore. To effectively reduce the noise introduced by the pseudo-data, we incorporate cross-lingual retrieval distances into the <span><math><mi>k</mi></math></span>NN probability construction process. Experiments in both high-resource and low-resource machine translation scenarios across multiple domains demonstrate that our method significantly improves the domain adaptation capabilities of NMT in both settings, yielding average improvements of 6.08 and 7.70 SacreBLEU points and 0.66 and 1.62 COMET scores on the multi-domain dataset, respectively.</div></div>","PeriodicalId":49939,"journal":{"name":"Knowledge-Based Systems","volume":"330 ","pages":"Article 114513"},"PeriodicalIF":7.6000,"publicationDate":"2025-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Knowledge-Based Systems","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0950705125015527","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Although Neural Machine Translation (NMT) has recently achieved remarkable performance improvements, it still faces challenges in domain adaptation. Previous research has focused on mitigating this issue by integrating translation knowledge from bilingual domain data. However, the limited availability of bilingual translation resources has constrained these methods in real world application. To address this inadequacy, solutions based on monolingual data, such as back-translation, have been proposed. Nevertheless, these methods often incur additional training costs due to the necessity of training reverse models to generate pseudo data. In light of this, we propose Pseudo-

k

NN-MT, which does not require additional training. This method creates pseudo-bilingual data pairs by retrieving semantically similar sentences from target language data and subsequently builds the

k

NN datastore. To effectively reduce the noise introduced by the pseudo-data, we incorporate cross-lingual retrieval distances into the

k

NN probability construction process. Experiments in both high-resource and low-resource machine translation scenarios across multiple domains demonstrate that our method significantly improves the domain adaptation capabilities of NMT in both settings, yielding average improvements of 6.08 and 7.70 SacreBLEU points and 0.66 and 1.62 COMET scores on the multi-domain dataset, respectively.

查看原文本刊更多论文

伪knn - mt：利用目标语言数据增强神经机器翻译的领域适应性

尽管神经机器翻译（NMT）近年来取得了显著的性能进步，但在领域自适应方面仍面临挑战。以前的研究主要是通过整合双语领域数据的翻译知识来缓解这一问题。然而，双语翻译资源的有限性制约了这些方法在现实世界中的应用。为了解决这一不足，已经提出了基于单语数据的解决方案，例如反翻译。然而，由于需要训练反向模型来生成伪数据，这些方法通常会产生额外的训练成本。鉴于此，我们提出了Pseudo-kNN-MT，它不需要额外的训练。该方法通过从目标语言数据中检索语义相似的句子来创建伪双语数据对，并随后构建kNN数据存储。为了有效降低伪数据带来的噪声，我们将跨语言检索距离纳入kNN概率构建过程。在多域高资源和低资源机器翻译场景下的实验表明，我们的方法在两种情况下都显著提高了NMT的域适应能力，在多域数据集上平均分别提高了6.08和7.70 SacreBLEU分和0.66和1.62 COMET分。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Knowledge-Based Systems 工程技术-计算机：人工智能

CiteScore

14.80

自引率

12.50%

发文量

1245

审稿时长

7.8 months

期刊介绍： Knowledge-Based Systems, an international and interdisciplinary journal in artificial intelligence, publishes original, innovative, and creative research results in the field. It focuses on knowledge-based and other artificial intelligence techniques-based systems. The journal aims to support human prediction and decision-making through data science and computation techniques, provide a balanced coverage of theory and practical study, and encourage the development and implementation of knowledge-based intelligence models, methods, systems, and software tools. Applications in business, government, education, engineering, and healthcare are emphasized.