基于正则组语义一致性的三模态群引导增量蒸馏多模态神经机器翻译

IF 6.9 1区管理学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

Information Processing & Management Pub Date : 2025-05-12 DOI:10.1016/j.ipm.2025.104149

Junjun Guo, Yunyue Li, Kaiwen Tan

{"title":"基于正则组语义一致性的三模态群引导增量蒸馏多模态神经机器翻译","authors":"Junjun Guo, Yunyue Li, Kaiwen Tan","doi":"10.1016/j.ipm.2025.104149","DOIUrl":null,"url":null,"abstract":"<div><div>Multi-modal Machine Translation (MMT) aims to tackle the challenge of cross-modal semantic alignment by integrating additional information from additional modalities, such as images, video, audio, and potentially other modalities. Unfortunately, collecting high-quality multi-modal data pairs is costly, leading to challenges in data scarcity or noise robustness. Most existing MMT research focuses on feature-level cross-modal fusion using these limited multi-modal data, training models from scratch without utilizing prior knowledge from established pure-text neural machine translation (NMT) models. This results in inefficient use of computational resources and cross-modal misalignment. To this end, this paper presents a triplet-modality group-guided incremental distillation approach, constrained by group-centered multi-modal semantic alignment, to extend the scope of machine translation in visual scenarios. The proposed approach preserves the translation capabilities of the pre-trained NMT model through triplet-modal group incremental distillation, while further improving translation performance through a regularized group alignment strategy, thereby enhancing machine translation ability in MMT. We conducted extensive experiments on two general-domain and two specific-domain MMT tasks. The results demonstrate that the proposed approach shows improvements over the state-of-the-art (SOTA) methods across all test sets, achieving performance gains of over 3.7%. In-depth analysis highlights the effectiveness and robustness of our method in cross-modal alignment and noisy visual scenarios. Our code is available at <span><span>https://github.com/lyy-nlp/MMT_main</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50365,"journal":{"name":"Information Processing & Management","volume":"62 5","pages":"Article 104149"},"PeriodicalIF":6.9000,"publicationDate":"2025-05-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Triplet-modality group-guided incremental distillation with regularized group semantic consistency for multi-modal neural machine translation\",\"authors\":\"Junjun Guo, Yunyue Li, Kaiwen Tan\",\"doi\":\"10.1016/j.ipm.2025.104149\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Multi-modal Machine Translation (MMT) aims to tackle the challenge of cross-modal semantic alignment by integrating additional information from additional modalities, such as images, video, audio, and potentially other modalities. Unfortunately, collecting high-quality multi-modal data pairs is costly, leading to challenges in data scarcity or noise robustness. Most existing MMT research focuses on feature-level cross-modal fusion using these limited multi-modal data, training models from scratch without utilizing prior knowledge from established pure-text neural machine translation (NMT) models. This results in inefficient use of computational resources and cross-modal misalignment. To this end, this paper presents a triplet-modality group-guided incremental distillation approach, constrained by group-centered multi-modal semantic alignment, to extend the scope of machine translation in visual scenarios. The proposed approach preserves the translation capabilities of the pre-trained NMT model through triplet-modal group incremental distillation, while further improving translation performance through a regularized group alignment strategy, thereby enhancing machine translation ability in MMT. We conducted extensive experiments on two general-domain and two specific-domain MMT tasks. The results demonstrate that the proposed approach shows improvements over the state-of-the-art (SOTA) methods across all test sets, achieving performance gains of over 3.7%. In-depth analysis highlights the effectiveness and robustness of our method in cross-modal alignment and noisy visual scenarios. Our code is available at <span><span>https://github.com/lyy-nlp/MMT_main</span><svg><path></path></svg></span>.</div></div>\",\"PeriodicalId\":50365,\"journal\":{\"name\":\"Information Processing & Management\",\"volume\":\"62 5\",\"pages\":\"Article 104149\"},\"PeriodicalIF\":6.9000,\"publicationDate\":\"2025-05-12\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Information Processing & Management\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0306457325000901\",\"RegionNum\":1,\"RegionCategory\":\"管理学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information Processing & Management","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0306457325000901","RegionNum":1,"RegionCategory":"管理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

摘要

多模态机器翻译（MMT）旨在通过集成来自其他模态（如图像、视频、音频和潜在的其他模态）的附加信息来解决跨模态语义对齐的挑战。不幸的是，收集高质量的多模态数据对成本很高，导致数据稀缺性和噪声鲁棒性方面的挑战。大多数现有的MMT研究都集中在使用这些有限的多模态数据进行特征级跨模态融合，从头开始训练模型，而不使用已建立的纯文本神经机器翻译（NMT）模型的先验知识。这将导致计算资源的低效使用和跨模态不对齐。为此，本文提出了一种以组为中心的多模态语义对齐约束下的三模态组引导增量蒸馏方法，以扩展视觉场景下机器翻译的范围。该方法通过三模态群增量蒸馏保留了预训练的NMT模型的翻译能力，同时通过正则化群对齐策略进一步提高了翻译性能，从而增强了MMT中的机器翻译能力。我们对两个通用领域和两个特定领域的MMT任务进行了广泛的实验。结果表明，所提出的方法在所有测试集上都比最先进的（SOTA）方法有所改进，实现了超过3.7%的性能提升。深入分析表明了该方法在跨模态对齐和噪声视觉场景下的有效性和鲁棒性。我们的代码可在https://github.com/lyy-nlp/MMT_main上获得。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Triplet-modality group-guided incremental distillation with regularized group semantic consistency for multi-modal neural machine translation

Multi-modal Machine Translation (MMT) aims to tackle the challenge of cross-modal semantic alignment by integrating additional information from additional modalities, such as images, video, audio, and potentially other modalities. Unfortunately, collecting high-quality multi-modal data pairs is costly, leading to challenges in data scarcity or noise robustness. Most existing MMT research focuses on feature-level cross-modal fusion using these limited multi-modal data, training models from scratch without utilizing prior knowledge from established pure-text neural machine translation (NMT) models. This results in inefficient use of computational resources and cross-modal misalignment. To this end, this paper presents a triplet-modality group-guided incremental distillation approach, constrained by group-centered multi-modal semantic alignment, to extend the scope of machine translation in visual scenarios. The proposed approach preserves the translation capabilities of the pre-trained NMT model through triplet-modal group incremental distillation, while further improving translation performance through a regularized group alignment strategy, thereby enhancing machine translation ability in MMT. We conducted extensive experiments on two general-domain and two specific-domain MMT tasks. The results demonstrate that the proposed approach shows improvements over the state-of-the-art (SOTA) methods across all test sets, achieving performance gains of over 3.7%. In-depth analysis highlights the effectiveness and robustness of our method in cross-modal alignment and noisy visual scenarios. Our code is available at https://github.com/lyy-nlp/MMT_main.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Information Processing & Management 工程技术-计算机：信息系统

CiteScore

17.00

自引率

11.60%

发文量

276

审稿时长

39 days

期刊介绍： Information Processing and Management is dedicated to publishing cutting-edge original research at the convergence of computing and information science. Our scope encompasses theory, methods, and applications across various domains, including advertising, business, health, information science, information technology marketing, and social computing. We aim to cater to the interests of both primary researchers and practitioners by offering an effective platform for the timely dissemination of advanced and topical issues in this interdisciplinary field. The journal places particular emphasis on original research articles, research survey articles, research method articles, and articles addressing critical applications of research. Join us in advancing knowledge and innovation at the intersection of computing and information science.