Linxi Li , Mingwei Tang , Shiqi Qing , Yanxi Zheng , Jie Hu , Mingfeng Zhao , Si Chen
{"title":"Action-Prompt: A unified visual prompt and fusion network for enhanced video action recognition","authors":"Linxi Li , Mingwei Tang , Shiqi Qing , Yanxi Zheng , Jie Hu , Mingfeng Zhao , Si Chen","doi":"10.1016/j.knosys.2025.113547","DOIUrl":null,"url":null,"abstract":"<div><div>Video action recognition is a crucial task in video understanding and has garnered significant attention from researchers. However, while most existing methods exploit spatiotemporal and motion features for action recognition, these methods fail to consider that the fusion of different features cannot fully adapt to this task. To address this issue, we designed a prompt block named the Prompt Learning Layer (PLL), which is a plug-and-play module that can be inserted into a backbone to learn visual prompts for action recognition tasks. Additionally, we propose the Spatio-Temporal and Motion Fusion Module (STMF), which utilizes innovative extraction and fusion strategies to enhance the complementarity between the different features. The STMF comprises two main modules: the Bidirectional Motion Difference Module (BiMDM), which deals with bidirectional motion features, and the Spatio-Temporal Adaptive Module (STAM), which deals with spatio-temporal features in an adaptive approach. Finally, the experimental results demonstrate that our proposed method outperforms the state-of-the-art performance on the Kinetics-400, Something–Something V1 and V2 datasets.</div></div>","PeriodicalId":49939,"journal":{"name":"Knowledge-Based Systems","volume":"318 ","pages":"Article 113547"},"PeriodicalIF":7.2000,"publicationDate":"2025-04-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Knowledge-Based Systems","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0950705125005933","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Video action recognition is a crucial task in video understanding and has garnered significant attention from researchers. However, while most existing methods exploit spatiotemporal and motion features for action recognition, these methods fail to consider that the fusion of different features cannot fully adapt to this task. To address this issue, we designed a prompt block named the Prompt Learning Layer (PLL), which is a plug-and-play module that can be inserted into a backbone to learn visual prompts for action recognition tasks. Additionally, we propose the Spatio-Temporal and Motion Fusion Module (STMF), which utilizes innovative extraction and fusion strategies to enhance the complementarity between the different features. The STMF comprises two main modules: the Bidirectional Motion Difference Module (BiMDM), which deals with bidirectional motion features, and the Spatio-Temporal Adaptive Module (STAM), which deals with spatio-temporal features in an adaptive approach. Finally, the experimental results demonstrate that our proposed method outperforms the state-of-the-art performance on the Kinetics-400, Something–Something V1 and V2 datasets.
期刊介绍:
Knowledge-Based Systems, an international and interdisciplinary journal in artificial intelligence, publishes original, innovative, and creative research results in the field. It focuses on knowledge-based and other artificial intelligence techniques-based systems. The journal aims to support human prediction and decision-making through data science and computation techniques, provide a balanced coverage of theory and practical study, and encourage the development and implementation of knowledge-based intelligence models, methods, systems, and software tools. Applications in business, government, education, engineering, and healthcare are emphasized.