Learning Prompt-Enhanced Context Features for Weakly-Supervised Video Anomaly Detection

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society Pub Date : 2024-09-05 DOI:10.1109/TIP.2024.3451935

Yujiang Pu;Xiaoyu Wu;Lulu Yang;Shengjin Wang

{"title":"Learning Prompt-Enhanced Context Features for Weakly-Supervised Video Anomaly Detection","authors":"Yujiang Pu;Xiaoyu Wu;Lulu Yang;Shengjin Wang","doi":"10.1109/TIP.2024.3451935","DOIUrl":null,"url":null,"abstract":"Weakly supervised video anomaly detection aims to locate abnormal activities in untrimmed videos without the need for frame-level supervision. Prior work has utilized graph convolution networks or self-attention mechanisms alongside multiple instance learning (MIL)-based classification loss to model temporal relations and learn discriminative features. However, these approaches are limited in two aspects: 1) Multi-branch parallel architectures, while capturing multi-scale temporal dependencies, inevitably lead to increased parameter and computational costs. 2) The binarized MIL constraint only ensures the interclass separability while neglecting the fine-grained discriminability within anomalous classes. To this end, we introduce a novel WS-VAD framework that focuses on efficient temporal modeling and anomaly innerclass discriminability. We first construct a Temporal Context Aggregation (TCA) module that simultaneously captures local-global dependencies by reusing an attention matrix along with adaptive context fusion. In addition, we propose a Prompt-Enhanced Learning (PEL) module that incorporates semantic priors using knowledge-based prompts to boost the discrimination of visual features while ensuring separability across anomaly subclasses. The proposed components have been validated through extensive experiments, which demonstrate superior performance on three challenging datasets, UCF-Crime, XD-Violence and ShanghaiTech, with fewer parameters and reduced computational effort. Notably, our method can significantly improve the detection accuracy for certain anomaly subclasses and reduced the false alarm rate. Our code is available at: \n<uri>https://github.com/yujiangpu20/PEL4VAD</uri>\n.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"33 ","pages":"4923-4936"},"PeriodicalIF":0.0000,"publicationDate":"2024-09-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10667004/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Weakly supervised video anomaly detection aims to locate abnormal activities in untrimmed videos without the need for frame-level supervision. Prior work has utilized graph convolution networks or self-attention mechanisms alongside multiple instance learning (MIL)-based classification loss to model temporal relations and learn discriminative features. However, these approaches are limited in two aspects: 1) Multi-branch parallel architectures, while capturing multi-scale temporal dependencies, inevitably lead to increased parameter and computational costs. 2) The binarized MIL constraint only ensures the interclass separability while neglecting the fine-grained discriminability within anomalous classes. To this end, we introduce a novel WS-VAD framework that focuses on efficient temporal modeling and anomaly innerclass discriminability. We first construct a Temporal Context Aggregation (TCA) module that simultaneously captures local-global dependencies by reusing an attention matrix along with adaptive context fusion. In addition, we propose a Prompt-Enhanced Learning (PEL) module that incorporates semantic priors using knowledge-based prompts to boost the discrimination of visual features while ensuring separability across anomaly subclasses. The proposed components have been validated through extensive experiments, which demonstrate superior performance on three challenging datasets, UCF-Crime, XD-Violence and ShanghaiTech, with fewer parameters and reduced computational effort. Notably, our method can significantly improve the detection accuracy for certain anomaly subclasses and reduced the false alarm rate. Our code is available at: https://github.com/yujiangpu20/PEL4VAD .

查看原文本刊更多论文

为弱监督视频异常检测学习提示增强型上下文特征

弱监督视频异常检测旨在定位未剪辑视频中的异常活动，而无需帧级监督。之前的工作利用图卷积网络或自注意机制，以及基于多实例学习（MIL）的分类损失，对时间关系建模并学习判别特征。然而，这些方法在两个方面受到限制：(1) 多分支并行架构在捕捉多尺度时间依赖性的同时，不可避免地导致参数和计算成本的增加。(2) 二值化 MIL 约束只能确保类间可分性，而忽略了异常类内的细粒度可分性。为此，我们引入了一种新颖的 WS-VAD 框架，重点关注高效的时态建模和异常类内部的可区分性。我们首先构建了一个时态上下文聚合（TCA）模块，该模块通过重用注意力矩阵和自适应上下文融合，同时捕捉局部和全局的依赖关系。此外，我们还提出了 "提示增强学习"（Prompt-Enhanced Learning，PEL）模块，该模块利用基于知识的提示纳入语义先验，以提高视觉特征的辨别能力，同时确保异常子类之间的可分离性。我们通过大量实验对所提出的组件进行了验证，结果表明这些组件在 UCF-犯罪、XD-暴力和 ShanghaiTech 这三个具有挑战性的数据集上表现出色，而且参数更少，计算量更小。值得注意的是，我们的方法能显著提高某些异常子类的检测准确率，并降低误报率。我们的代码见：https://github.com/yujiangpu20/PEL4VAD。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society

自引率

0.00%

发文量