Long Liu , Jianjun Li , Guang Li , Yunfeng Zhai , Ming Zhang
{"title":"VadCLIP++: Dynamic vision-language model for weakly supervised video anomaly detection","authors":"Long Liu , Jianjun Li , Guang Li , Yunfeng Zhai , Ming Zhang","doi":"10.1016/j.dsp.2025.105560","DOIUrl":null,"url":null,"abstract":"<div><div>In the realm of weakly supervised video anomaly detection (WSVAD), the integration of Contrastive Language-Image Pre-training (CLIP) models has demonstrated substantial benefits, highlighting that learning through textual prompts can effectively distinguish between anomalous events and enhance the expression of visual features. However, existing CLIP models in video anomaly detection tasks rely solely on static textual prompts, neglecting the temporal continuity of anomalous behaviors, which limits the understanding of dynamic anomalous behaviors. To address this, this paper proposes a dynamic learnable text prompting mechanism, which supervises and learns the frame difference features of adjacent consecutive frames in the video to capture the dynamic changes of anomalous behaviors. At the same time, by incorporating static textual prompts, the model precisely focuses on anomalous behaviors within individual frames, making the textual prompts in different states more sensitive to the temporal and spatial action details of the video. In addition, a Spatial Feature Selection Module (SFSM) is proposed, which leverages random sampling and TOP-k selection mechanisms to enhance the model's generalization ability in anomalous regions of video frames, while modeling the spatial relationships of the global context. The dynamic learnable text prompt branch handles temporal anomalies by extracting inter-frame difference features, while the static text prompt branch optimizes in-frame anomaly localization under the influence of the SFSM module. The dual-branch collaboration establishes complementary spatiotemporal representations, collectively enhancing detection performance. Experimental results show that on the XD-Violence and UCF-Crime datasets, the proposed method achieves 85.03% AP and 88.12% AUC, thoroughly validating its effectiveness in anomaly detection tasks.</div></div>","PeriodicalId":51011,"journal":{"name":"Digital Signal Processing","volume":"168 ","pages":"Article 105560"},"PeriodicalIF":3.0000,"publicationDate":"2025-08-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Digital Signal Processing","FirstCategoryId":"5","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1051200425005822","RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
引用次数: 0
Abstract
In the realm of weakly supervised video anomaly detection (WSVAD), the integration of Contrastive Language-Image Pre-training (CLIP) models has demonstrated substantial benefits, highlighting that learning through textual prompts can effectively distinguish between anomalous events and enhance the expression of visual features. However, existing CLIP models in video anomaly detection tasks rely solely on static textual prompts, neglecting the temporal continuity of anomalous behaviors, which limits the understanding of dynamic anomalous behaviors. To address this, this paper proposes a dynamic learnable text prompting mechanism, which supervises and learns the frame difference features of adjacent consecutive frames in the video to capture the dynamic changes of anomalous behaviors. At the same time, by incorporating static textual prompts, the model precisely focuses on anomalous behaviors within individual frames, making the textual prompts in different states more sensitive to the temporal and spatial action details of the video. In addition, a Spatial Feature Selection Module (SFSM) is proposed, which leverages random sampling and TOP-k selection mechanisms to enhance the model's generalization ability in anomalous regions of video frames, while modeling the spatial relationships of the global context. The dynamic learnable text prompt branch handles temporal anomalies by extracting inter-frame difference features, while the static text prompt branch optimizes in-frame anomaly localization under the influence of the SFSM module. The dual-branch collaboration establishes complementary spatiotemporal representations, collectively enhancing detection performance. Experimental results show that on the XD-Violence and UCF-Crime datasets, the proposed method achieves 85.03% AP and 88.12% AUC, thoroughly validating its effectiveness in anomaly detection tasks.
期刊介绍:
Digital Signal Processing: A Review Journal is one of the oldest and most established journals in the field of signal processing yet it aims to be the most innovative. The Journal invites top quality research articles at the frontiers of research in all aspects of signal processing. Our objective is to provide a platform for the publication of ground-breaking research in signal processing with both academic and industrial appeal.
The journal has a special emphasis on statistical signal processing methodology such as Bayesian signal processing, and encourages articles on emerging applications of signal processing such as:
• big data• machine learning• internet of things• information security• systems biology and computational biology,• financial time series analysis,• autonomous vehicles,• quantum computing,• neuromorphic engineering,• human-computer interaction and intelligent user interfaces,• environmental signal processing,• geophysical signal processing including seismic signal processing,• chemioinformatics and bioinformatics,• audio, visual and performance arts,• disaster management and prevention,• renewable energy,