{"title":"Multimodal retrieval-augmented generation framework for machine translation","authors":"Shijian Li","doi":"10.4218/etrij.2024-0196","DOIUrl":null,"url":null,"abstract":"<p>The development of multimodal machine translation (MMT) systems has attracted significant interest due to their potential to enhance translation accuracy with visual information. However, there are two limitations: (i) scarce large-scale corpus data in the form of (text, image, text) triplets and (ii) the semantic information learned by pre-training cannot transfer to multilingual translation tasks. To address these challenges, we propose a novel multimodal retrieval-augmented generation framework for machine translation, abbreviated as MRF-MT. Specifically, using the source text as a query, we retrieve relevant (image, text) pairs to guide image generation and feed the generated images into the image encoder of Multilingual Contrastive Language-Image Pre-training (M-CLIP) for learning visual information. Subsequently, we employ a projection network to transfer visual information learned by M-CLIP as a decoder prefix to Multilingual Bidirectional and Auto-Regressive Transformers (mBART) and train the mBART decoder using a two-stage pre-training pipeline. Initially, the mBART decoder is trained for image captioning with a visual–textual decoder prefix from M-CLIP's image encoder projection network. Subsequently, it undergoes training for caption translation, using prefixes from M-CLIP's text encoder. Extensive experiments show that MFR-MT achieves promising performance compared with baselines.</p>","PeriodicalId":11901,"journal":{"name":"ETRI Journal","volume":"47 4","pages":"707-720"},"PeriodicalIF":1.6000,"publicationDate":"2025-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.4218/etrij.2024-0196","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ETRI Journal","FirstCategoryId":"94","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.4218/etrij.2024-0196","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
引用次数: 0
Abstract
The development of multimodal machine translation (MMT) systems has attracted significant interest due to their potential to enhance translation accuracy with visual information. However, there are two limitations: (i) scarce large-scale corpus data in the form of (text, image, text) triplets and (ii) the semantic information learned by pre-training cannot transfer to multilingual translation tasks. To address these challenges, we propose a novel multimodal retrieval-augmented generation framework for machine translation, abbreviated as MRF-MT. Specifically, using the source text as a query, we retrieve relevant (image, text) pairs to guide image generation and feed the generated images into the image encoder of Multilingual Contrastive Language-Image Pre-training (M-CLIP) for learning visual information. Subsequently, we employ a projection network to transfer visual information learned by M-CLIP as a decoder prefix to Multilingual Bidirectional and Auto-Regressive Transformers (mBART) and train the mBART decoder using a two-stage pre-training pipeline. Initially, the mBART decoder is trained for image captioning with a visual–textual decoder prefix from M-CLIP's image encoder projection network. Subsequently, it undergoes training for caption translation, using prefixes from M-CLIP's text encoder. Extensive experiments show that MFR-MT achieves promising performance compared with baselines.
期刊介绍:
ETRI Journal is an international, peer-reviewed multidisciplinary journal published bimonthly in English. The main focus of the journal is to provide an open forum to exchange innovative ideas and technology in the fields of information, telecommunications, and electronics.
Key topics of interest include high-performance computing, big data analytics, cloud computing, multimedia technology, communication networks and services, wireless communications and mobile computing, material and component technology, as well as security.
With an international editorial committee and experts from around the world as reviewers, ETRI Journal publishes high-quality research papers on the latest and best developments from the global community.