{"title":"基于正则组语义一致性的三模态群引导增量蒸馏多模态神经机器翻译","authors":"Junjun Guo, Yunyue Li, Kaiwen Tan","doi":"10.1016/j.ipm.2025.104149","DOIUrl":null,"url":null,"abstract":"<div><div>Multi-modal Machine Translation (MMT) aims to tackle the challenge of cross-modal semantic alignment by integrating additional information from additional modalities, such as images, video, audio, and potentially other modalities. Unfortunately, collecting high-quality multi-modal data pairs is costly, leading to challenges in data scarcity or noise robustness. Most existing MMT research focuses on feature-level cross-modal fusion using these limited multi-modal data, training models from scratch without utilizing prior knowledge from established pure-text neural machine translation (NMT) models. This results in inefficient use of computational resources and cross-modal misalignment. To this end, this paper presents a triplet-modality group-guided incremental distillation approach, constrained by group-centered multi-modal semantic alignment, to extend the scope of machine translation in visual scenarios. The proposed approach preserves the translation capabilities of the pre-trained NMT model through triplet-modal group incremental distillation, while further improving translation performance through a regularized group alignment strategy, thereby enhancing machine translation ability in MMT. We conducted extensive experiments on two general-domain and two specific-domain MMT tasks. The results demonstrate that the proposed approach shows improvements over the state-of-the-art (SOTA) methods across all test sets, achieving performance gains of over 3.7%. In-depth analysis highlights the effectiveness and robustness of our method in cross-modal alignment and noisy visual scenarios. Our code is available at <span><span>https://github.com/lyy-nlp/MMT_main</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50365,"journal":{"name":"Information Processing & Management","volume":"62 5","pages":"Article 104149"},"PeriodicalIF":6.9000,"publicationDate":"2025-05-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Triplet-modality group-guided incremental distillation with regularized group semantic consistency for multi-modal neural machine translation\",\"authors\":\"Junjun Guo, Yunyue Li, Kaiwen Tan\",\"doi\":\"10.1016/j.ipm.2025.104149\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Multi-modal Machine Translation (MMT) aims to tackle the challenge of cross-modal semantic alignment by integrating additional information from additional modalities, such as images, video, audio, and potentially other modalities. Unfortunately, collecting high-quality multi-modal data pairs is costly, leading to challenges in data scarcity or noise robustness. Most existing MMT research focuses on feature-level cross-modal fusion using these limited multi-modal data, training models from scratch without utilizing prior knowledge from established pure-text neural machine translation (NMT) models. This results in inefficient use of computational resources and cross-modal misalignment. To this end, this paper presents a triplet-modality group-guided incremental distillation approach, constrained by group-centered multi-modal semantic alignment, to extend the scope of machine translation in visual scenarios. The proposed approach preserves the translation capabilities of the pre-trained NMT model through triplet-modal group incremental distillation, while further improving translation performance through a regularized group alignment strategy, thereby enhancing machine translation ability in MMT. We conducted extensive experiments on two general-domain and two specific-domain MMT tasks. The results demonstrate that the proposed approach shows improvements over the state-of-the-art (SOTA) methods across all test sets, achieving performance gains of over 3.7%. In-depth analysis highlights the effectiveness and robustness of our method in cross-modal alignment and noisy visual scenarios. Our code is available at <span><span>https://github.com/lyy-nlp/MMT_main</span><svg><path></path></svg></span>.</div></div>\",\"PeriodicalId\":50365,\"journal\":{\"name\":\"Information Processing & Management\",\"volume\":\"62 5\",\"pages\":\"Article 104149\"},\"PeriodicalIF\":6.9000,\"publicationDate\":\"2025-05-12\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Information Processing & Management\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0306457325000901\",\"RegionNum\":1,\"RegionCategory\":\"管理学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information Processing & Management","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0306457325000901","RegionNum":1,"RegionCategory":"管理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
Triplet-modality group-guided incremental distillation with regularized group semantic consistency for multi-modal neural machine translation
Multi-modal Machine Translation (MMT) aims to tackle the challenge of cross-modal semantic alignment by integrating additional information from additional modalities, such as images, video, audio, and potentially other modalities. Unfortunately, collecting high-quality multi-modal data pairs is costly, leading to challenges in data scarcity or noise robustness. Most existing MMT research focuses on feature-level cross-modal fusion using these limited multi-modal data, training models from scratch without utilizing prior knowledge from established pure-text neural machine translation (NMT) models. This results in inefficient use of computational resources and cross-modal misalignment. To this end, this paper presents a triplet-modality group-guided incremental distillation approach, constrained by group-centered multi-modal semantic alignment, to extend the scope of machine translation in visual scenarios. The proposed approach preserves the translation capabilities of the pre-trained NMT model through triplet-modal group incremental distillation, while further improving translation performance through a regularized group alignment strategy, thereby enhancing machine translation ability in MMT. We conducted extensive experiments on two general-domain and two specific-domain MMT tasks. The results demonstrate that the proposed approach shows improvements over the state-of-the-art (SOTA) methods across all test sets, achieving performance gains of over 3.7%. In-depth analysis highlights the effectiveness and robustness of our method in cross-modal alignment and noisy visual scenarios. Our code is available at https://github.com/lyy-nlp/MMT_main.
期刊介绍:
Information Processing and Management is dedicated to publishing cutting-edge original research at the convergence of computing and information science. Our scope encompasses theory, methods, and applications across various domains, including advertising, business, health, information science, information technology marketing, and social computing.
We aim to cater to the interests of both primary researchers and practitioners by offering an effective platform for the timely dissemination of advanced and topical issues in this interdisciplinary field. The journal places particular emphasis on original research articles, research survey articles, research method articles, and articles addressing critical applications of research. Join us in advancing knowledge and innovation at the intersection of computing and information science.