基于多模态语义蒸馏的视觉语言模型跨语言自适应

IF 9.7 1区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Multimedia Pub Date : 2025-04-03 DOI:10.1109/TMM.2025.3557678

Yu Weng;Wenbin He;Jun Dong;Chaomurilige;Xuan Liu;Zheng Liu

{"title":"基于多模态语义蒸馏的视觉语言模型跨语言自适应","authors":"Yu Weng;Wenbin He;Jun Dong;Chaomurilige;Xuan Liu;Zheng Liu","doi":"10.1109/TMM.2025.3557678","DOIUrl":null,"url":null,"abstract":"Large Multimodal Models (LMMs) excel in English multimedia tasks but face challenges in adapting to other languages due to linguistic diversity, limited non-English multimodal data, and high training costs. Existing approaches rely on machine-translated multimodal corpora or multilingual large language models, yet they demand substantial resources and achieve only modest zero-shot cross-lingual transfer performance, as shown in the IGLUE benchmark. In this work, we propose SMSA, a Syntax-aware Multimodal Semantic Adaptation approach, which efficiently extends vision-language models (VLMs) to multiple languages via a lightweight adaptation module. Instead of learning from scratch, SMSA transfers multimodal knowledge from English-trained models using two key components: (1) a Syntax-aware Adapter (SAA), which restructures multilingual text representations to align better with English syntax, reducing cross-lingual misalignment; (2) a Multimodal Semantic Distillation (MSD) method, which enables the model to mimic English sequence processing and retain multimodal associations across languages. This allows efficient adaptation to new languages while preserving the original model's strong multimodal capabilities. We extend an MoE-based VLM to 8 languages using a small translation dataset. Evaluations on the IGLUE benchmark show that SMSA achieves strong zero-shot transfer, outperforming some multilingual LMMs and demonstrating its effectiveness in cross-lingual vision-language adaptation.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"3184-3196"},"PeriodicalIF":9.7000,"publicationDate":"2025-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Cross-Lingual Adaptation for Vision-Language Model via Multimodal Semantic Distillation\",\"authors\":\"Yu Weng;Wenbin He;Jun Dong;Chaomurilige;Xuan Liu;Zheng Liu\",\"doi\":\"10.1109/TMM.2025.3557678\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Large Multimodal Models (LMMs) excel in English multimedia tasks but face challenges in adapting to other languages due to linguistic diversity, limited non-English multimodal data, and high training costs. Existing approaches rely on machine-translated multimodal corpora or multilingual large language models, yet they demand substantial resources and achieve only modest zero-shot cross-lingual transfer performance, as shown in the IGLUE benchmark. In this work, we propose SMSA, a Syntax-aware Multimodal Semantic Adaptation approach, which efficiently extends vision-language models (VLMs) to multiple languages via a lightweight adaptation module. Instead of learning from scratch, SMSA transfers multimodal knowledge from English-trained models using two key components: (1) a Syntax-aware Adapter (SAA), which restructures multilingual text representations to align better with English syntax, reducing cross-lingual misalignment; (2) a Multimodal Semantic Distillation (MSD) method, which enables the model to mimic English sequence processing and retain multimodal associations across languages. This allows efficient adaptation to new languages while preserving the original model's strong multimodal capabilities. We extend an MoE-based VLM to 8 languages using a small translation dataset. Evaluations on the IGLUE benchmark show that SMSA achieves strong zero-shot transfer, outperforming some multilingual LMMs and demonstrating its effectiveness in cross-lingual vision-language adaptation.\",\"PeriodicalId\":13273,\"journal\":{\"name\":\"IEEE Transactions on Multimedia\",\"volume\":\"27 \",\"pages\":\"3184-3196\"},\"PeriodicalIF\":9.7000,\"publicationDate\":\"2025-04-03\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Multimedia\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10948343/\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Multimedia","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10948343/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

摘要

大型多模态模型（Large Multimodal Models, lmm）在英语多媒体任务中表现优异，但由于语言的多样性、非英语多模态数据的有限性以及高昂的训练成本，lmm在适应其他语言方面面临挑战。现有的方法依赖于机器翻译的多模态语料库或多语言大型语言模型，但它们需要大量资源，并且只能实现适度的零概率跨语言迁移性能，如IGLUE基准所示。在这项工作中，我们提出了SMSA，一种语法感知的多模态语义自适应方法，该方法通过轻量级的自适应模块有效地将视觉语言模型（vlm）扩展到多种语言。SMSA不是从头开始学习，而是使用两个关键组件从英语训练的模型中转移多模态知识：(1)语法感知适配器（SAA），它重构多语言文本表示以更好地与英语语法对齐，减少跨语言偏差；(2)多模态语义蒸馏（MSD）方法，该方法使模型能够模拟英语序列处理并保留跨语言的多模态关联。这允许有效地适应新语言，同时保留原始模型强大的多模式功能。我们使用一个小的翻译数据集将基于moe的VLM扩展到8种语言。对IGLUE基准的评估表明，SMSA实现了较强的零射击迁移，优于一些多语言lmm，证明了其在跨语言视觉语言适应方面的有效性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Cross-Lingual Adaptation for Vision-Language Model via Multimodal Semantic Distillation

Large Multimodal Models (LMMs) excel in English multimedia tasks but face challenges in adapting to other languages due to linguistic diversity, limited non-English multimodal data, and high training costs. Existing approaches rely on machine-translated multimodal corpora or multilingual large language models, yet they demand substantial resources and achieve only modest zero-shot cross-lingual transfer performance, as shown in the IGLUE benchmark. In this work, we propose SMSA, a Syntax-aware Multimodal Semantic Adaptation approach, which efficiently extends vision-language models (VLMs) to multiple languages via a lightweight adaptation module. Instead of learning from scratch, SMSA transfers multimodal knowledge from English-trained models using two key components: (1) a Syntax-aware Adapter (SAA), which restructures multilingual text representations to align better with English syntax, reducing cross-lingual misalignment; (2) a Multimodal Semantic Distillation (MSD) method, which enables the model to mimic English sequence processing and retain multimodal associations across languages. This allows efficient adaptation to new languages while preserving the original model's strong multimodal capabilities. We extend an MoE-based VLM to 8 languages using a small translation dataset. Evaluations on the IGLUE benchmark show that SMSA achieves strong zero-shot transfer, outperforming some multilingual LMMs and demonstrating its effectiveness in cross-lingual vision-language adaptation.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE Transactions on Multimedia 工程技术-电信学

CiteScore

11.70

自引率

11.00%

发文量

576

审稿时长

5.5 months

期刊介绍： The IEEE Transactions on Multimedia delves into diverse aspects of multimedia technology and applications, covering circuits, networking, signal processing, systems, software, and systems integration. The scope aligns with the Fields of Interest of the sponsors, ensuring a comprehensive exploration of research in multimedia.