VadCLIP++: Dynamic vision-language model for weakly supervised video anomaly detection

IF 3 3区工程技术 Q2 ENGINEERING, ELECTRICAL & ELECTRONIC

Digital Signal Processing Pub Date : 2025-08-19 DOI:10.1016/j.dsp.2025.105560

Long Liu , Jianjun Li , Guang Li , Yunfeng Zhai , Ming Zhang

{"title":"VadCLIP++: Dynamic vision-language model for weakly supervised video anomaly detection","authors":"Long Liu , Jianjun Li , Guang Li , Yunfeng Zhai , Ming Zhang","doi":"10.1016/j.dsp.2025.105560","DOIUrl":null,"url":null,"abstract":"<div><div>In the realm of weakly supervised video anomaly detection (WSVAD), the integration of Contrastive Language-Image Pre-training (CLIP) models has demonstrated substantial benefits, highlighting that learning through textual prompts can effectively distinguish between anomalous events and enhance the expression of visual features. However, existing CLIP models in video anomaly detection tasks rely solely on static textual prompts, neglecting the temporal continuity of anomalous behaviors, which limits the understanding of dynamic anomalous behaviors. To address this, this paper proposes a dynamic learnable text prompting mechanism, which supervises and learns the frame difference features of adjacent consecutive frames in the video to capture the dynamic changes of anomalous behaviors. At the same time, by incorporating static textual prompts, the model precisely focuses on anomalous behaviors within individual frames, making the textual prompts in different states more sensitive to the temporal and spatial action details of the video. In addition, a Spatial Feature Selection Module (SFSM) is proposed, which leverages random sampling and TOP-k selection mechanisms to enhance the model's generalization ability in anomalous regions of video frames, while modeling the spatial relationships of the global context. The dynamic learnable text prompt branch handles temporal anomalies by extracting inter-frame difference features, while the static text prompt branch optimizes in-frame anomaly localization under the influence of the SFSM module. The dual-branch collaboration establishes complementary spatiotemporal representations, collectively enhancing detection performance. Experimental results show that on the XD-Violence and UCF-Crime datasets, the proposed method achieves 85.03% AP and 88.12% AUC, thoroughly validating its effectiveness in anomaly detection tasks.</div></div>","PeriodicalId":51011,"journal":{"name":"Digital Signal Processing","volume":"168 ","pages":"Article 105560"},"PeriodicalIF":3.0000,"publicationDate":"2025-08-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Digital Signal Processing","FirstCategoryId":"5","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1051200425005822","RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}

引用次数: 0

Abstract

In the realm of weakly supervised video anomaly detection (WSVAD), the integration of Contrastive Language-Image Pre-training (CLIP) models has demonstrated substantial benefits, highlighting that learning through textual prompts can effectively distinguish between anomalous events and enhance the expression of visual features. However, existing CLIP models in video anomaly detection tasks rely solely on static textual prompts, neglecting the temporal continuity of anomalous behaviors, which limits the understanding of dynamic anomalous behaviors. To address this, this paper proposes a dynamic learnable text prompting mechanism, which supervises and learns the frame difference features of adjacent consecutive frames in the video to capture the dynamic changes of anomalous behaviors. At the same time, by incorporating static textual prompts, the model precisely focuses on anomalous behaviors within individual frames, making the textual prompts in different states more sensitive to the temporal and spatial action details of the video. In addition, a Spatial Feature Selection Module (SFSM) is proposed, which leverages random sampling and TOP-k selection mechanisms to enhance the model's generalization ability in anomalous regions of video frames, while modeling the spatial relationships of the global context. The dynamic learnable text prompt branch handles temporal anomalies by extracting inter-frame difference features, while the static text prompt branch optimizes in-frame anomaly localization under the influence of the SFSM module. The dual-branch collaboration establishes complementary spatiotemporal representations, collectively enhancing detection performance. Experimental results show that on the XD-Violence and UCF-Crime datasets, the proposed method achieves 85.03% AP and 88.12% AUC, thoroughly validating its effectiveness in anomaly detection tasks.

查看原文本刊更多论文

弱监督视频异常检测的动态视觉语言模型

在弱监督视频异常检测（WSVAD）领域，对比语言-图像预训练（CLIP）模型的集成已经显示出实质性的好处，强调通过文本提示学习可以有效区分异常事件并增强视觉特征的表达。然而，现有视频异常检测任务中的CLIP模型仅依赖于静态文本提示，忽略了异常行为的时间连续性，这限制了对动态异常行为的理解。针对这一问题，本文提出了一种动态可学习文本提示机制，该机制对视频中相邻连续帧的帧差特征进行监督学习，捕捉异常行为的动态变化。同时，通过结合静态文本提示，该模型精确地关注单个帧内的异常行为，使不同状态的文本提示对视频的时空动作细节更加敏感。此外，提出了空间特征选择模块（smfsm），该模块利用随机采样和TOP-k选择机制增强模型在视频帧异常区域的泛化能力，同时对全局上下文的空间关系进行建模。动态可学习文本提示分支通过提取帧间差异特征来处理时间异常，静态文本提示分支在smfsm模块的作用下优化帧内异常定位。双分支协作建立了互补的时空表征，共同提高了检测性能。实验结果表明，在XD-Violence和UCF-Crime数据集上，该方法的AP和AUC分别达到85.03%和88.12%，充分验证了该方法在异常检测任务中的有效性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Digital Signal Processing 工程技术-工程：电子与电气

CiteScore

5.30

自引率

17.20%

发文量

435

审稿时长

66 days

期刊介绍： Digital Signal Processing: A Review Journal is one of the oldest and most established journals in the field of signal processing yet it aims to be the most innovative. The Journal invites top quality research articles at the frontiers of research in all aspects of signal processing. Our objective is to provide a platform for the publication of ground-breaking research in signal processing with both academic and industrial appeal. The journal has a special emphasis on statistical signal processing methodology such as Bayesian signal processing, and encourages articles on emerging applications of signal processing such as: • big data• machine learning• internet of things• information security• systems biology and computational biology,• financial time series analysis,• autonomous vehicles,• quantum computing,• neuromorphic engineering,• human-computer interaction and intelligent user interfaces,• environmental signal processing,• geophysical signal processing including seismic signal processing,• chemioinformatics and bioinformatics,• audio, visual and performance arts,• disaster management and prevention,• renewable energy,