Yishan Hu , Jun Zhao , Chen Qi , Yan Qiang , Juanjuan Zhao , Bo Pei
{"title":"VC-Mamba: Causal Mamba representation consistency for video implicit understanding","authors":"Yishan Hu , Jun Zhao , Chen Qi , Yan Qiang , Juanjuan Zhao , Bo Pei","doi":"10.1016/j.knosys.2025.113437","DOIUrl":null,"url":null,"abstract":"<div><div>Recently, spatiotemporal representation learning based on deep learning has driven the advancement of video understanding. However, existing methods based on convolutional neural networks (CNNs) and Transformers still face limitations in understanding implicit information in complex scenes, particularly in capturing dynamic changes over long-range spatiotemporal data and inferring hidden contextual information in videos. To address these challenges, we propose VC-Mamba, a video implicit understanding model based on causal Mamba representation consistency. By segmenting explicit texture information into token features and leveraging the linear Mamba framework to capture long-range spatiotemporal interactions, we introduce the spatiotemporal motion Mamba block for motion perception. This block includes a multi-head temporal length Mamba to enhance cross-frame motion consistency and a bidirectional gated space Mamba to capture the inter-frame dependencies of feature tokens. Through the analysis of both explicit and implicit spatiotemporal interactions, VC-Mamba effectively captures long-range spatiotemporal representations. Additionally, we design an attention mask perturbation strategy based on causal invariance constraints to optimize the existing selective spatiotemporal mask mechanism. By progressively enhancing the causal strength of related features, this strategy analyzes implicit causal chains in videos, improving the model’s resistance to interference from weakly causal features and enhancing the robustness and stability of implicit information understanding. Finally, we conducted extensive experiments on several datasets, including short-term action recognition and long-term video reasoning tasks. The results demonstrate that VC-Mamba matches or surpasses state-of-the-art models, particularly in capturing long-range spatiotemporal interactions and causal reasoning, proving its effectiveness and generalization in video implicit understanding tasks.</div></div>","PeriodicalId":49939,"journal":{"name":"Knowledge-Based Systems","volume":"317 ","pages":"Article 113437"},"PeriodicalIF":7.2000,"publicationDate":"2025-04-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Knowledge-Based Systems","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0950705125004848","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Recently, spatiotemporal representation learning based on deep learning has driven the advancement of video understanding. However, existing methods based on convolutional neural networks (CNNs) and Transformers still face limitations in understanding implicit information in complex scenes, particularly in capturing dynamic changes over long-range spatiotemporal data and inferring hidden contextual information in videos. To address these challenges, we propose VC-Mamba, a video implicit understanding model based on causal Mamba representation consistency. By segmenting explicit texture information into token features and leveraging the linear Mamba framework to capture long-range spatiotemporal interactions, we introduce the spatiotemporal motion Mamba block for motion perception. This block includes a multi-head temporal length Mamba to enhance cross-frame motion consistency and a bidirectional gated space Mamba to capture the inter-frame dependencies of feature tokens. Through the analysis of both explicit and implicit spatiotemporal interactions, VC-Mamba effectively captures long-range spatiotemporal representations. Additionally, we design an attention mask perturbation strategy based on causal invariance constraints to optimize the existing selective spatiotemporal mask mechanism. By progressively enhancing the causal strength of related features, this strategy analyzes implicit causal chains in videos, improving the model’s resistance to interference from weakly causal features and enhancing the robustness and stability of implicit information understanding. Finally, we conducted extensive experiments on several datasets, including short-term action recognition and long-term video reasoning tasks. The results demonstrate that VC-Mamba matches or surpasses state-of-the-art models, particularly in capturing long-range spatiotemporal interactions and causal reasoning, proving its effectiveness and generalization in video implicit understanding tasks.
期刊介绍:
Knowledge-Based Systems, an international and interdisciplinary journal in artificial intelligence, publishes original, innovative, and creative research results in the field. It focuses on knowledge-based and other artificial intelligence techniques-based systems. The journal aims to support human prediction and decision-making through data science and computation techniques, provide a balanced coverage of theory and practical study, and encourage the development and implementation of knowledge-based intelligence models, methods, systems, and software tools. Applications in business, government, education, engineering, and healthcare are emphasized.