从基于图像的大型多模态模型到视频任务的有效转换

IF 9.7 1区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Multimedia Pub Date : 2025-04-03 DOI:10.1109/TMM.2025.3557692

Shidong Cao;Zhonghan Zhao;Shengyu Hao;Wenhao Chai;Jenq-Neng Hwang;Hongwei Wang;Gaoang Wang

{"title":"从基于图像的大型多模态模型到视频任务的有效转换","authors":"Shidong Cao;Zhonghan Zhao;Shengyu Hao;Wenhao Chai;Jenq-Neng Hwang;Hongwei Wang;Gaoang Wang","doi":"10.1109/TMM.2025.3557692","DOIUrl":null,"url":null,"abstract":"Extending image-based Large Multimodal Models (LMMs) to video-based LMMs always requires temporal modeling in the pre-training. However, training the temporal modules gradually erases the knowledge of visual features learned from various image-text-based scenarios, leading to degradation in some downstream tasks. To address this issue, in this paper, we introduce a novel, efficient transfer approach termed MTransLLAMA, which employs transfer learning from pre-trained image LMMs for fine-grained video tasks with only small-scale training sets. Our method enables <bold>fewer trainable parameters and achieves <bold>faster adaptation and <bold>higher accuracy than pre-training video-based LMM models. Specifically, our method adopts early fusion between textual and visual features to capture fine-grained information, reuses spatial attention weights in temporal attentions for cyclical spatial-temporal reasoning, and introduces dynamic attention routing to capture both global and local information in spatial-temporal attentions. Experiments demonstrate that across multiple datasets and tasks, without relying on video pre-training, our model achieves state-of-the-art performance, enabling lightweight and efficient transfer from image-based LMMs to fine-grained video tasks.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"3045-3056"},"PeriodicalIF":9.7000,"publicationDate":"2025-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Efficient Transfer From Image-Based Large Multimodal Models to Video Tasks\",\"authors\":\"Shidong Cao;Zhonghan Zhao;Shengyu Hao;Wenhao Chai;Jenq-Neng Hwang;Hongwei Wang;Gaoang Wang\",\"doi\":\"10.1109/TMM.2025.3557692\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Extending image-based Large Multimodal Models (LMMs) to video-based LMMs always requires temporal modeling in the pre-training. However, training the temporal modules gradually erases the knowledge of visual features learned from various image-text-based scenarios, leading to degradation in some downstream tasks. To address this issue, in this paper, we introduce a novel, efficient transfer approach termed MTransLLAMA, which employs transfer learning from pre-trained image LMMs for fine-grained video tasks with only small-scale training sets. Our method enables <bold>fewer trainable parameters and achieves <bold>faster adaptation and <bold>higher accuracy than pre-training video-based LMM models. Specifically, our method adopts early fusion between textual and visual features to capture fine-grained information, reuses spatial attention weights in temporal attentions for cyclical spatial-temporal reasoning, and introduces dynamic attention routing to capture both global and local information in spatial-temporal attentions. Experiments demonstrate that across multiple datasets and tasks, without relying on video pre-training, our model achieves state-of-the-art performance, enabling lightweight and efficient transfer from image-based LMMs to fine-grained video tasks.\",\"PeriodicalId\":13273,\"journal\":{\"name\":\"IEEE Transactions on Multimedia\",\"volume\":\"27 \",\"pages\":\"3045-3056\"},\"PeriodicalIF\":9.7000,\"publicationDate\":\"2025-04-03\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Multimedia\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10948327/\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Multimedia","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10948327/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

摘要

将基于图像的大型多模态模型扩展到基于视频的大型多模态模型，在预训练中总是需要进行时间建模。然而，训练时间模块会逐渐抹去从各种基于图像-文本的场景中学习到的视觉特征知识，导致一些下游任务的退化。为了解决这个问题，在本文中，我们引入了一种新的，高效的迁移方法，称为MTransLLAMA，它使用预训练图像lmm的迁移学习来处理只有小规模训练集的细粒度视频任务。与基于预训练视频的LMM模型相比，该方法可以实现更少的可训练参数，实现更快的自适应和更高的精度。具体而言，我们的方法采用文本和视觉特征的早期融合来捕获细粒度信息，重用时间注意中的空间注意权值进行循环时空推理，并引入动态注意路由来捕获时空注意中的全局和局部信息。实验表明，在多个数据集和任务中，不依赖于视频预训练，我们的模型达到了最先进的性能，实现了从基于图像的lmm到细粒度视频任务的轻量级和高效传输。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Efficient Transfer From Image-Based Large Multimodal Models to Video Tasks

Extending image-based Large Multimodal Models (LMMs) to video-based LMMs always requires temporal modeling in the pre-training. However, training the temporal modules gradually erases the knowledge of visual features learned from various image-text-based scenarios, leading to degradation in some downstream tasks. To address this issue, in this paper, we introduce a novel, efficient transfer approach termed MTransLLAMA, which employs transfer learning from pre-trained image LMMs for fine-grained video tasks with only small-scale training sets. Our method enables fewer trainable parameters and achieves faster adaptation and higher accuracy than pre-training video-based LMM models. Specifically, our method adopts early fusion between textual and visual features to capture fine-grained information, reuses spatial attention weights in temporal attentions for cyclical spatial-temporal reasoning, and introduces dynamic attention routing to capture both global and local information in spatial-temporal attentions. Experiments demonstrate that across multiple datasets and tasks, without relying on video pre-training, our model achieves state-of-the-art performance, enabling lightweight and efficient transfer from image-based LMMs to fine-grained video tasks.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE Transactions on Multimedia 工程技术-电信学

CiteScore

11.70

自引率

11.00%

发文量

576

审稿时长

5.5 months

期刊介绍： The IEEE Transactions on Multimedia delves into diverse aspects of multimedia technology and applications, covering circuits, networking, signal processing, systems, software, and systems integration. The scope aligns with the Fields of Interest of the sponsors, ensuring a comprehensive exploration of research in multimedia.