Dan Liu , Zhouli Shen , Ai Peng , Zhiyuan Ma , Jinpeng Mi , Mao Ye , Jianwei Zhang
{"title":"JSS-CLIP:增强图像到视频的学习与拼图侧网络","authors":"Dan Liu , Zhouli Shen , Ai Peng , Zhiyuan Ma , Jinpeng Mi , Mao Ye , Jianwei Zhang","doi":"10.1016/j.inffus.2025.103775","DOIUrl":null,"url":null,"abstract":"<div><div>Large pre-trained vision-language models, such as CLIP, have achieved remarkable success in computer vision. However, the challenge of extending image-based models to video understanding through effective temporal modeling remains an open problem. Although recent studies have shifted their focus towards image-to-video transfer learning, the majority of existing methods overlook algorithm efficiency when adapting large models to the video domain. In this paper, we propose an innovative JigSaw Side network, JSS-CLIP, aiming to balance the algorithm efficiency and spatiotemporal modeling performance for video action recognition. Specifically, we introduce lightweight side networks attached to the frozen vision model, which avoids the backpropagation through the computationally intensive pre-trained model, thereby significantly reducing computational costs. Additionally, we design an implicit alignment module to guide the generation of hierarchical spatiotemporal JigSaw feature maps. These feature maps encapsulate rich motion information and action cues within videos, facilitating a comprehensive understanding of dynamic content. We conduct extensive experiments on three large-scale action datasets, whose results consistently demonstrate the competitiveness of JSS-CLIP in terms of efficiency and performance. The source code will be released at https://github.com/liarshen/JSS-CLIP.</div></div>","PeriodicalId":50367,"journal":{"name":"Information Fusion","volume":"127 ","pages":"Article 103775"},"PeriodicalIF":15.5000,"publicationDate":"2025-09-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"JSS-CLIP: Boosting image-to-video transfer learning with JigSaw side network\",\"authors\":\"Dan Liu , Zhouli Shen , Ai Peng , Zhiyuan Ma , Jinpeng Mi , Mao Ye , Jianwei Zhang\",\"doi\":\"10.1016/j.inffus.2025.103775\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Large pre-trained vision-language models, such as CLIP, have achieved remarkable success in computer vision. However, the challenge of extending image-based models to video understanding through effective temporal modeling remains an open problem. Although recent studies have shifted their focus towards image-to-video transfer learning, the majority of existing methods overlook algorithm efficiency when adapting large models to the video domain. In this paper, we propose an innovative JigSaw Side network, JSS-CLIP, aiming to balance the algorithm efficiency and spatiotemporal modeling performance for video action recognition. Specifically, we introduce lightweight side networks attached to the frozen vision model, which avoids the backpropagation through the computationally intensive pre-trained model, thereby significantly reducing computational costs. Additionally, we design an implicit alignment module to guide the generation of hierarchical spatiotemporal JigSaw feature maps. These feature maps encapsulate rich motion information and action cues within videos, facilitating a comprehensive understanding of dynamic content. We conduct extensive experiments on three large-scale action datasets, whose results consistently demonstrate the competitiveness of JSS-CLIP in terms of efficiency and performance. The source code will be released at https://github.com/liarshen/JSS-CLIP.</div></div>\",\"PeriodicalId\":50367,\"journal\":{\"name\":\"Information Fusion\",\"volume\":\"127 \",\"pages\":\"Article 103775\"},\"PeriodicalIF\":15.5000,\"publicationDate\":\"2025-09-27\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Information Fusion\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S1566253525008371\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information Fusion","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1566253525008371","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
JSS-CLIP: Boosting image-to-video transfer learning with JigSaw side network
Large pre-trained vision-language models, such as CLIP, have achieved remarkable success in computer vision. However, the challenge of extending image-based models to video understanding through effective temporal modeling remains an open problem. Although recent studies have shifted their focus towards image-to-video transfer learning, the majority of existing methods overlook algorithm efficiency when adapting large models to the video domain. In this paper, we propose an innovative JigSaw Side network, JSS-CLIP, aiming to balance the algorithm efficiency and spatiotemporal modeling performance for video action recognition. Specifically, we introduce lightweight side networks attached to the frozen vision model, which avoids the backpropagation through the computationally intensive pre-trained model, thereby significantly reducing computational costs. Additionally, we design an implicit alignment module to guide the generation of hierarchical spatiotemporal JigSaw feature maps. These feature maps encapsulate rich motion information and action cues within videos, facilitating a comprehensive understanding of dynamic content. We conduct extensive experiments on three large-scale action datasets, whose results consistently demonstrate the competitiveness of JSS-CLIP in terms of efficiency and performance. The source code will be released at https://github.com/liarshen/JSS-CLIP.
期刊介绍:
Information Fusion serves as a central platform for showcasing advancements in multi-sensor, multi-source, multi-process information fusion, fostering collaboration among diverse disciplines driving its progress. It is the leading outlet for sharing research and development in this field, focusing on architectures, algorithms, and applications. Papers dealing with fundamental theoretical analyses as well as those demonstrating their application to real-world problems will be welcome.