Injecting Text Clues for Improving Anomalous Event Detection From Weakly Labeled Videos

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society Pub Date : 2024-10-15 DOI:10.1109/TIP.2024.3477351

Tianshan Liu;Kin-Man Lam;Bing-Kun Bao

{"title":"Injecting Text Clues for Improving Anomalous Event Detection From Weakly Labeled Videos","authors":"Tianshan Liu;Kin-Man Lam;Bing-Kun Bao","doi":"10.1109/TIP.2024.3477351","DOIUrl":null,"url":null,"abstract":"Video anomaly detection (VAD) aims at localizing the snippets containing anomalous events in long unconstrained videos. The weakly supervised (WS) setting, where solely video-level labels are available during training, has attracted considerable attention, owing to its satisfactory trade-off between the detection performance and annotation cost. However, due to lack of snippet-level dense labels, the existing WS-VAD methods still get easily stuck on the detection errors, caused by false alarms and incomplete localization. To address this dilemma, in this paper, we propose to inject text clues of anomaly-event categories for improving WS-VAD, via a dedicated dual-branch framework. For suppressing the response of confusing normal contexts, we first present a text-guided anomaly discovering (TAG) branch based on a hierarchical matching scheme, which utilizes the label-text queries to search the discriminative anomalous snippets in a global-to-local fashion. To facilitate the completeness of anomaly-instance localization, an anomaly-conditioned text completion (ATC) branch is further designed to perform an auxiliary generative task, which intrinsically forces the model to gather sufficient event semantics from all the relevant anomalous snippets for completely reconstructing the masked description sentence. Furthermore, to encourage the cross-branch knowledge sharing, a mutual learning strategy is introduced by imposing a consistency constraint on the anomaly scores of these two branches. Extensive experimental results on two public benchmarks validate that the proposed method achieves superior performance over the competing methods.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"33 ","pages":"5907-5920"},"PeriodicalIF":0.0000,"publicationDate":"2024-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10719608/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Video anomaly detection (VAD) aims at localizing the snippets containing anomalous events in long unconstrained videos. The weakly supervised (WS) setting, where solely video-level labels are available during training, has attracted considerable attention, owing to its satisfactory trade-off between the detection performance and annotation cost. However, due to lack of snippet-level dense labels, the existing WS-VAD methods still get easily stuck on the detection errors, caused by false alarms and incomplete localization. To address this dilemma, in this paper, we propose to inject text clues of anomaly-event categories for improving WS-VAD, via a dedicated dual-branch framework. For suppressing the response of confusing normal contexts, we first present a text-guided anomaly discovering (TAG) branch based on a hierarchical matching scheme, which utilizes the label-text queries to search the discriminative anomalous snippets in a global-to-local fashion. To facilitate the completeness of anomaly-instance localization, an anomaly-conditioned text completion (ATC) branch is further designed to perform an auxiliary generative task, which intrinsically forces the model to gather sufficient event semantics from all the relevant anomalous snippets for completely reconstructing the masked description sentence. Furthermore, to encourage the cross-branch knowledge sharing, a mutual learning strategy is introduced by imposing a consistency constraint on the anomaly scores of these two branches. Extensive experimental results on two public benchmarks validate that the proposed method achieves superior performance over the competing methods.

查看原文本刊更多论文

注入文本线索，改进从弱标签视频中检测异常事件的能力

视频异常检测（VAD）旨在定位长视频中包含异常事件的片段。弱监督（WS）设置在训练过程中仅提供视频级标签，由于其在检测性能和注释成本之间令人满意的权衡，已经引起了广泛关注。然而，由于缺乏片段级的密集标签，现有的 WS-VAD 方法仍然很容易陷入误报和定位不完整造成的检测错误。针对这一困境，本文提出通过专门的双分支框架，注入异常事件类别的文本线索，以改进 WS-VAD。为了抑制正常上下文混淆的反应，我们首先提出了基于分层匹配方案的文本引导异常发现（TAG）分支，该分支利用标签文本查询，以全局到局部的方式搜索具有区分性的异常片段。为了促进异常实例定位的完整性，还进一步设计了异常条件文本补全（ATC）分支来执行辅助生成任务，从本质上迫使模型从所有相关异常片段中收集足够的事件语义，以完全重建屏蔽描述句子。此外，为了鼓励跨分支知识共享，我们还引入了相互学习策略，对这两个分支的异常得分施加一致性约束。在两个公共基准上的广泛实验结果验证了所提出的方法比其他竞争方法性能更优越。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society

自引率

0.00%

发文量