Multimodal retrieval-augmented generation framework for machine translation

IF 1.6 4区计算机科学 Q3 ENGINEERING, ELECTRICAL & ELECTRONIC

ETRI Journal Pub Date : 2025-05-19 DOI:10.4218/etrij.2024-0196

Shijian Li

{"title":"Multimodal retrieval-augmented generation framework for machine translation","authors":"Shijian Li","doi":"10.4218/etrij.2024-0196","DOIUrl":null,"url":null,"abstract":"<p>The development of multimodal machine translation (MMT) systems has attracted significant interest due to their potential to enhance translation accuracy with visual information. However, there are two limitations: (i) scarce large-scale corpus data in the form of (text, image, text) triplets and (ii) the semantic information learned by pre-training cannot transfer to multilingual translation tasks. To address these challenges, we propose a novel multimodal retrieval-augmented generation framework for machine translation, abbreviated as MRF-MT. Specifically, using the source text as a query, we retrieve relevant (image, text) pairs to guide image generation and feed the generated images into the image encoder of Multilingual Contrastive Language-Image Pre-training (M-CLIP) for learning visual information. Subsequently, we employ a projection network to transfer visual information learned by M-CLIP as a decoder prefix to Multilingual Bidirectional and Auto-Regressive Transformers (mBART) and train the mBART decoder using a two-stage pre-training pipeline. Initially, the mBART decoder is trained for image captioning with a visual–textual decoder prefix from M-CLIP's image encoder projection network. Subsequently, it undergoes training for caption translation, using prefixes from M-CLIP's text encoder. Extensive experiments show that MFR-MT achieves promising performance compared with baselines.</p>","PeriodicalId":11901,"journal":{"name":"ETRI Journal","volume":"47 4","pages":"707-720"},"PeriodicalIF":1.6000,"publicationDate":"2025-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.4218/etrij.2024-0196","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ETRI Journal","FirstCategoryId":"94","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.4218/etrij.2024-0196","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}

引用次数: 0

Abstract

The development of multimodal machine translation (MMT) systems has attracted significant interest due to their potential to enhance translation accuracy with visual information. However, there are two limitations: (i) scarce large-scale corpus data in the form of (text, image, text) triplets and (ii) the semantic information learned by pre-training cannot transfer to multilingual translation tasks. To address these challenges, we propose a novel multimodal retrieval-augmented generation framework for machine translation, abbreviated as MRF-MT. Specifically, using the source text as a query, we retrieve relevant (image, text) pairs to guide image generation and feed the generated images into the image encoder of Multilingual Contrastive Language-Image Pre-training (M-CLIP) for learning visual information. Subsequently, we employ a projection network to transfer visual information learned by M-CLIP as a decoder prefix to Multilingual Bidirectional and Auto-Regressive Transformers (mBART) and train the mBART decoder using a two-stage pre-training pipeline. Initially, the mBART decoder is trained for image captioning with a visual–textual decoder prefix from M-CLIP's image encoder projection network. Subsequently, it undergoes training for caption translation, using prefixes from M-CLIP's text encoder. Extensive experiments show that MFR-MT achieves promising performance compared with baselines.

Abstract Image

查看原文本刊更多论文

机器翻译的多模态检索-增强生成框架

多模态机器翻译（MMT）系统的发展因其具有提高视觉信息翻译准确性的潜力而引起了人们的极大兴趣。然而，存在两个限制：(i)缺乏（文本、图像、文本）三元组形式的大规模语料库数据；（ii）通过预训练学习到的语义信息无法转移到多语言翻译任务中。为了解决这些挑战，我们提出了一种新的机器翻译多模态检索增强生成框架，简称为MRF-MT。具体而言，我们使用源文本作为查询，检索相关（图像，文本）对来指导图像生成，并将生成的图像馈送到多语言对比语言图像预训练（M-CLIP）的图像编码器中以学习视觉信息。随后，我们使用投影网络将M-CLIP学习到的视觉信息作为解码器前缀传输到多语言双向自回归变压器（mBART），并使用两阶段预训练管道训练mBART解码器。最初，mBART解码器使用来自M-CLIP图像编码器投影网络的视觉文本解码器前缀进行图像字幕训练。随后，它接受字幕翻译训练，使用M-CLIP文本编码器的前缀。大量的实验表明，与基线相比，MFR-MT具有良好的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

ETRI Journal 工程技术-电信学

CiteScore

4.00

自引率

7.10%

发文量

审稿时长

6.9 months

期刊介绍： ETRI Journal is an international, peer-reviewed multidisciplinary journal published bimonthly in English. The main focus of the journal is to provide an open forum to exchange innovative ideas and technology in the fields of information, telecommunications, and electronics. Key topics of interest include high-performance computing, big data analytics, cloud computing, multimedia technology, communication networks and services, wireless communications and mobile computing, material and component technology, as well as security. With an international editorial committee and experts from around the world as reviewers, ETRI Journal publishes high-quality research papers on the latest and best developments from the global community.