Qishi Zheng , Mengnan He , Jiuqin Duan , Gai Luo , Pengcheng Wu , Yimin Han , Qingyue Min , Peng Chen , Ping Zhang
{"title":"Clip4Vis: Parameter-free fusion for multimodal video recognition","authors":"Qishi Zheng , Mengnan He , Jiuqin Duan , Gai Luo , Pengcheng Wu , Yimin Han , Qingyue Min , Peng Chen , Ping Zhang","doi":"10.1016/j.neucom.2025.131046","DOIUrl":null,"url":null,"abstract":"<div><div>Multimodal video recognition has emerged as a central focus due to its ability to effectively integrate information from diverse modalities, such as video and text. However, traditional fusion methods typically rely on trainable parameters, resulting in increased model computational costs. To address these challenges, this paper presents <strong>Clip4Vis</strong>, a zero-parameter progressive fusion framework that combines video and text features using a shallow-to-deep approach. The shallow and deep fusion steps are implemented through two key modules: (i) <strong>Cross-Model Attention</strong>, a module that enhances video embeddings with textual information, enabling adaptive focus on keyframes to improve action representation in the video. (ii) <strong>Joint Temporal-Textual Aggregation</strong>, a module that integrates video embeddings and word embeddings by jointly utilizing temporal and textual information, enabling global information aggregation. Extensive evaluations on five widely used video datasets demonstrate that our method achieves competitive performance in general, zero-shot, and few-shot video recognition. Our best model, using the released CLIP model, achieves a state-of-the-art accuracy of 87.4 % for general recognition on Kinetics-400 and 75.3 % for zero-shot recognition on Kinetics-600. The code will be released later.</div></div>","PeriodicalId":19268,"journal":{"name":"Neurocomputing","volume":"652 ","pages":"Article 131046"},"PeriodicalIF":5.5000,"publicationDate":"2025-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Neurocomputing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0925231225017187","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Multimodal video recognition has emerged as a central focus due to its ability to effectively integrate information from diverse modalities, such as video and text. However, traditional fusion methods typically rely on trainable parameters, resulting in increased model computational costs. To address these challenges, this paper presents Clip4Vis, a zero-parameter progressive fusion framework that combines video and text features using a shallow-to-deep approach. The shallow and deep fusion steps are implemented through two key modules: (i) Cross-Model Attention, a module that enhances video embeddings with textual information, enabling adaptive focus on keyframes to improve action representation in the video. (ii) Joint Temporal-Textual Aggregation, a module that integrates video embeddings and word embeddings by jointly utilizing temporal and textual information, enabling global information aggregation. Extensive evaluations on five widely used video datasets demonstrate that our method achieves competitive performance in general, zero-shot, and few-shot video recognition. Our best model, using the released CLIP model, achieves a state-of-the-art accuracy of 87.4 % for general recognition on Kinetics-400 and 75.3 % for zero-shot recognition on Kinetics-600. The code will be released later.
期刊介绍:
Neurocomputing publishes articles describing recent fundamental contributions in the field of neurocomputing. Neurocomputing theory, practice and applications are the essential topics being covered.