弱监督视频异常检测的动态视觉语言模型

IF 3 3区 工程技术 Q2 ENGINEERING, ELECTRICAL & ELECTRONIC
Long Liu , Jianjun Li , Guang Li , Yunfeng Zhai , Ming Zhang
{"title":"弱监督视频异常检测的动态视觉语言模型","authors":"Long Liu ,&nbsp;Jianjun Li ,&nbsp;Guang Li ,&nbsp;Yunfeng Zhai ,&nbsp;Ming Zhang","doi":"10.1016/j.dsp.2025.105560","DOIUrl":null,"url":null,"abstract":"<div><div>In the realm of weakly supervised video anomaly detection (WSVAD), the integration of Contrastive Language-Image Pre-training (CLIP) models has demonstrated substantial benefits, highlighting that learning through textual prompts can effectively distinguish between anomalous events and enhance the expression of visual features. However, existing CLIP models in video anomaly detection tasks rely solely on static textual prompts, neglecting the temporal continuity of anomalous behaviors, which limits the understanding of dynamic anomalous behaviors. To address this, this paper proposes a dynamic learnable text prompting mechanism, which supervises and learns the frame difference features of adjacent consecutive frames in the video to capture the dynamic changes of anomalous behaviors. At the same time, by incorporating static textual prompts, the model precisely focuses on anomalous behaviors within individual frames, making the textual prompts in different states more sensitive to the temporal and spatial action details of the video. In addition, a Spatial Feature Selection Module (SFSM) is proposed, which leverages random sampling and TOP-k selection mechanisms to enhance the model's generalization ability in anomalous regions of video frames, while modeling the spatial relationships of the global context. The dynamic learnable text prompt branch handles temporal anomalies by extracting inter-frame difference features, while the static text prompt branch optimizes in-frame anomaly localization under the influence of the SFSM module. The dual-branch collaboration establishes complementary spatiotemporal representations, collectively enhancing detection performance. Experimental results show that on the XD-Violence and UCF-Crime datasets, the proposed method achieves 85.03% AP and 88.12% AUC, thoroughly validating its effectiveness in anomaly detection tasks.</div></div>","PeriodicalId":51011,"journal":{"name":"Digital Signal Processing","volume":"168 ","pages":"Article 105560"},"PeriodicalIF":3.0000,"publicationDate":"2025-08-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"VadCLIP++: Dynamic vision-language model for weakly supervised video anomaly detection\",\"authors\":\"Long Liu ,&nbsp;Jianjun Li ,&nbsp;Guang Li ,&nbsp;Yunfeng Zhai ,&nbsp;Ming Zhang\",\"doi\":\"10.1016/j.dsp.2025.105560\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>In the realm of weakly supervised video anomaly detection (WSVAD), the integration of Contrastive Language-Image Pre-training (CLIP) models has demonstrated substantial benefits, highlighting that learning through textual prompts can effectively distinguish between anomalous events and enhance the expression of visual features. However, existing CLIP models in video anomaly detection tasks rely solely on static textual prompts, neglecting the temporal continuity of anomalous behaviors, which limits the understanding of dynamic anomalous behaviors. To address this, this paper proposes a dynamic learnable text prompting mechanism, which supervises and learns the frame difference features of adjacent consecutive frames in the video to capture the dynamic changes of anomalous behaviors. At the same time, by incorporating static textual prompts, the model precisely focuses on anomalous behaviors within individual frames, making the textual prompts in different states more sensitive to the temporal and spatial action details of the video. In addition, a Spatial Feature Selection Module (SFSM) is proposed, which leverages random sampling and TOP-k selection mechanisms to enhance the model's generalization ability in anomalous regions of video frames, while modeling the spatial relationships of the global context. The dynamic learnable text prompt branch handles temporal anomalies by extracting inter-frame difference features, while the static text prompt branch optimizes in-frame anomaly localization under the influence of the SFSM module. The dual-branch collaboration establishes complementary spatiotemporal representations, collectively enhancing detection performance. Experimental results show that on the XD-Violence and UCF-Crime datasets, the proposed method achieves 85.03% AP and 88.12% AUC, thoroughly validating its effectiveness in anomaly detection tasks.</div></div>\",\"PeriodicalId\":51011,\"journal\":{\"name\":\"Digital Signal Processing\",\"volume\":\"168 \",\"pages\":\"Article 105560\"},\"PeriodicalIF\":3.0000,\"publicationDate\":\"2025-08-19\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Digital Signal Processing\",\"FirstCategoryId\":\"5\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S1051200425005822\",\"RegionNum\":3,\"RegionCategory\":\"工程技术\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"ENGINEERING, ELECTRICAL & ELECTRONIC\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Digital Signal Processing","FirstCategoryId":"5","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1051200425005822","RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
引用次数: 0

摘要

在弱监督视频异常检测(WSVAD)领域,对比语言-图像预训练(CLIP)模型的集成已经显示出实质性的好处,强调通过文本提示学习可以有效区分异常事件并增强视觉特征的表达。然而,现有视频异常检测任务中的CLIP模型仅依赖于静态文本提示,忽略了异常行为的时间连续性,这限制了对动态异常行为的理解。针对这一问题,本文提出了一种动态可学习文本提示机制,该机制对视频中相邻连续帧的帧差特征进行监督学习,捕捉异常行为的动态变化。同时,通过结合静态文本提示,该模型精确地关注单个帧内的异常行为,使不同状态的文本提示对视频的时空动作细节更加敏感。此外,提出了空间特征选择模块(smfsm),该模块利用随机采样和TOP-k选择机制增强模型在视频帧异常区域的泛化能力,同时对全局上下文的空间关系进行建模。动态可学习文本提示分支通过提取帧间差异特征来处理时间异常,静态文本提示分支在smfsm模块的作用下优化帧内异常定位。双分支协作建立了互补的时空表征,共同提高了检测性能。实验结果表明,在XD-Violence和UCF-Crime数据集上,该方法的AP和AUC分别达到85.03%和88.12%,充分验证了该方法在异常检测任务中的有效性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
VadCLIP++: Dynamic vision-language model for weakly supervised video anomaly detection
In the realm of weakly supervised video anomaly detection (WSVAD), the integration of Contrastive Language-Image Pre-training (CLIP) models has demonstrated substantial benefits, highlighting that learning through textual prompts can effectively distinguish between anomalous events and enhance the expression of visual features. However, existing CLIP models in video anomaly detection tasks rely solely on static textual prompts, neglecting the temporal continuity of anomalous behaviors, which limits the understanding of dynamic anomalous behaviors. To address this, this paper proposes a dynamic learnable text prompting mechanism, which supervises and learns the frame difference features of adjacent consecutive frames in the video to capture the dynamic changes of anomalous behaviors. At the same time, by incorporating static textual prompts, the model precisely focuses on anomalous behaviors within individual frames, making the textual prompts in different states more sensitive to the temporal and spatial action details of the video. In addition, a Spatial Feature Selection Module (SFSM) is proposed, which leverages random sampling and TOP-k selection mechanisms to enhance the model's generalization ability in anomalous regions of video frames, while modeling the spatial relationships of the global context. The dynamic learnable text prompt branch handles temporal anomalies by extracting inter-frame difference features, while the static text prompt branch optimizes in-frame anomaly localization under the influence of the SFSM module. The dual-branch collaboration establishes complementary spatiotemporal representations, collectively enhancing detection performance. Experimental results show that on the XD-Violence and UCF-Crime datasets, the proposed method achieves 85.03% AP and 88.12% AUC, thoroughly validating its effectiveness in anomaly detection tasks.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Digital Signal Processing
Digital Signal Processing 工程技术-工程:电子与电气
CiteScore
5.30
自引率
17.20%
发文量
435
审稿时长
66 days
期刊介绍: Digital Signal Processing: A Review Journal is one of the oldest and most established journals in the field of signal processing yet it aims to be the most innovative. The Journal invites top quality research articles at the frontiers of research in all aspects of signal processing. Our objective is to provide a platform for the publication of ground-breaking research in signal processing with both academic and industrial appeal. The journal has a special emphasis on statistical signal processing methodology such as Bayesian signal processing, and encourages articles on emerging applications of signal processing such as: • big data• machine learning• internet of things• information security• systems biology and computational biology,• financial time series analysis,• autonomous vehicles,• quantum computing,• neuromorphic engineering,• human-computer interaction and intelligent user interfaces,• environmental signal processing,• geophysical signal processing including seismic signal processing,• chemioinformatics and bioinformatics,• audio, visual and performance arts,• disaster management and prevention,• renewable energy,
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信