VC-Mamba: Causal Mamba representation consistency for video implicit understanding

IF 7.2 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE
Yishan Hu , Jun Zhao , Chen Qi , Yan Qiang , Juanjuan Zhao , Bo Pei
{"title":"VC-Mamba: Causal Mamba representation consistency for video implicit understanding","authors":"Yishan Hu ,&nbsp;Jun Zhao ,&nbsp;Chen Qi ,&nbsp;Yan Qiang ,&nbsp;Juanjuan Zhao ,&nbsp;Bo Pei","doi":"10.1016/j.knosys.2025.113437","DOIUrl":null,"url":null,"abstract":"<div><div>Recently, spatiotemporal representation learning based on deep learning has driven the advancement of video understanding. However, existing methods based on convolutional neural networks (CNNs) and Transformers still face limitations in understanding implicit information in complex scenes, particularly in capturing dynamic changes over long-range spatiotemporal data and inferring hidden contextual information in videos. To address these challenges, we propose VC-Mamba, a video implicit understanding model based on causal Mamba representation consistency. By segmenting explicit texture information into token features and leveraging the linear Mamba framework to capture long-range spatiotemporal interactions, we introduce the spatiotemporal motion Mamba block for motion perception. This block includes a multi-head temporal length Mamba to enhance cross-frame motion consistency and a bidirectional gated space Mamba to capture the inter-frame dependencies of feature tokens. Through the analysis of both explicit and implicit spatiotemporal interactions, VC-Mamba effectively captures long-range spatiotemporal representations. Additionally, we design an attention mask perturbation strategy based on causal invariance constraints to optimize the existing selective spatiotemporal mask mechanism. By progressively enhancing the causal strength of related features, this strategy analyzes implicit causal chains in videos, improving the model’s resistance to interference from weakly causal features and enhancing the robustness and stability of implicit information understanding. Finally, we conducted extensive experiments on several datasets, including short-term action recognition and long-term video reasoning tasks. The results demonstrate that VC-Mamba matches or surpasses state-of-the-art models, particularly in capturing long-range spatiotemporal interactions and causal reasoning, proving its effectiveness and generalization in video implicit understanding tasks.</div></div>","PeriodicalId":49939,"journal":{"name":"Knowledge-Based Systems","volume":"317 ","pages":"Article 113437"},"PeriodicalIF":7.2000,"publicationDate":"2025-04-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Knowledge-Based Systems","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0950705125004848","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0

Abstract

Recently, spatiotemporal representation learning based on deep learning has driven the advancement of video understanding. However, existing methods based on convolutional neural networks (CNNs) and Transformers still face limitations in understanding implicit information in complex scenes, particularly in capturing dynamic changes over long-range spatiotemporal data and inferring hidden contextual information in videos. To address these challenges, we propose VC-Mamba, a video implicit understanding model based on causal Mamba representation consistency. By segmenting explicit texture information into token features and leveraging the linear Mamba framework to capture long-range spatiotemporal interactions, we introduce the spatiotemporal motion Mamba block for motion perception. This block includes a multi-head temporal length Mamba to enhance cross-frame motion consistency and a bidirectional gated space Mamba to capture the inter-frame dependencies of feature tokens. Through the analysis of both explicit and implicit spatiotemporal interactions, VC-Mamba effectively captures long-range spatiotemporal representations. Additionally, we design an attention mask perturbation strategy based on causal invariance constraints to optimize the existing selective spatiotemporal mask mechanism. By progressively enhancing the causal strength of related features, this strategy analyzes implicit causal chains in videos, improving the model’s resistance to interference from weakly causal features and enhancing the robustness and stability of implicit information understanding. Finally, we conducted extensive experiments on several datasets, including short-term action recognition and long-term video reasoning tasks. The results demonstrate that VC-Mamba matches or surpasses state-of-the-art models, particularly in capturing long-range spatiotemporal interactions and causal reasoning, proving its effectiveness and generalization in video implicit understanding tasks.
最近,基于深度学习的时空表示学习推动了视频理解的进步。然而,现有的基于卷积神经网络(CNN)和变形器的方法在理解复杂场景中的隐含信息时仍面临局限,尤其是在捕捉长距离时空数据的动态变化和推断视频中隐藏的上下文信息方面。为了应对这些挑战,我们提出了基于因果 Mamba 表示一致性的视频隐式理解模型 VC-Mamba。通过将显式纹理信息分割为标记特征,并利用线性 Mamba 框架捕捉长距离时空交互,我们引入了时空运动 Mamba 块来进行运动感知。该模块包括一个多头时长 Mamba,用于增强跨帧运动一致性,以及一个双向门控空间 Mamba,用于捕捉特征标记的帧间依赖性。通过分析显性和隐性时空互动,VC-Mamba 能有效捕捉长距离时空表征。此外,我们还设计了一种基于因果不变性约束的注意力掩码扰动策略,以优化现有的选择性时空掩码机制。通过逐步增强相关特征的因果强度,该策略可以分析视频中的隐含因果链,提高模型对弱因果特征干扰的抵抗力,增强隐含信息理解的鲁棒性和稳定性。最后,我们在多个数据集上进行了广泛的实验,包括短期动作识别和长期视频推理任务。结果表明,VC-Mamba 与最先进的模型不相上下,甚至有过之而无不及,尤其是在捕捉长距离时空交互和因果推理方面,证明了它在视频内隐理解任务中的有效性和通用性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Knowledge-Based Systems
Knowledge-Based Systems 工程技术-计算机:人工智能
CiteScore
14.80
自引率
12.50%
发文量
1245
审稿时长
7.8 months
期刊介绍: Knowledge-Based Systems, an international and interdisciplinary journal in artificial intelligence, publishes original, innovative, and creative research results in the field. It focuses on knowledge-based and other artificial intelligence techniques-based systems. The journal aims to support human prediction and decision-making through data science and computation techniques, provide a balanced coverage of theory and practical study, and encourage the development and implementation of knowledge-based intelligence models, methods, systems, and software tools. Applications in business, government, education, engineering, and healthcare are emphasized.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信