Multimodal Attention-Enhanced Feature Fusion-Based Weakly Supervised Anomaly Violence Detection

IEEE Open Journal of the Computer Society Pub Date : 2024-12-13 DOI:10.1109/OJCS.2024.3517154

Jungpil Shin;Abu Saleh Musa Miah;Yuta Kaneko;Najmul Hassan;Hyoun-Sup Lee;Si-Woong Jang

{"title":"Multimodal Attention-Enhanced Feature Fusion-Based Weakly Supervised Anomaly Violence Detection","authors":"Jungpil Shin;Abu Saleh Musa Miah;Yuta Kaneko;Najmul Hassan;Hyoun-Sup Lee;Si-Woong Jang","doi":"10.1109/OJCS.2024.3517154","DOIUrl":null,"url":null,"abstract":"Weakly supervised video anomaly detection (WS-VAD) plays a pivotal role in advancing intelligent surveillance systems within the field of computer vision. Despite significant research, WS-VAD continues to face challenges, particularly with unimodal approaches that struggle to extract meaningful features effectively. A few research studies have been done on the multimodal dataset fusion-based WS-VAD system, and their performance accuracy is unsatisfactory. In response, we propose a novel WS-VAD system leveraging multimodal datasets with an attention-enhanced feature fusion approach to address these challenges. Our system integrates three distinct data modalities—RGB video, optical flow, and audio signals—where each stream extracts complementary spatial and temporal features using an enhanced attention module to improve the detection accuracy and robustness. In the RGB video stream, we employ a multi-stage, attention-driven feature enhancement process to refine spatial and temporal features. This process begins with a ViT-based CLIP module, where the top k features are concatenated with I3D- and TCA-based spatiotemporal features. Temporal dependencies are then captured through uncertainty-regulated dual memory units (UR-DMUs), allowing the simultaneous learning of normal and anomalous patterns. The final stage selects the most relevant features, yielding a refined representation of RGB-based data. The second stream extracts enhanced spatiotemporal features from flow data using a deep learning and attention module. Lastly, the audio stream detects anomalies in sound patterns through an attention module integrated with the VGGish model, capturing auditory cues. The fusion of these three streams captures motion and audio signals often missed by visual analysis alone, significantly enhancing anomaly detection accuracy and robustness. Our multimodal fusion achieves an average precision (AP) of 88.28% on the XD-Violence dataset, outperforming prior models by nearly 2%, and attains AUCs of 98.71% on the ShanghaiTech dataset and 90.26% on the UCF-Crime dataset. These results underscore the effectiveness of our approach, consistently surpassing existing methods across three benchmark datasets and validating its robustness in WS-VAD applications.","PeriodicalId":13205,"journal":{"name":"IEEE Open Journal of the Computer Society","volume":"6 ","pages":"129-140"},"PeriodicalIF":0.0000,"publicationDate":"2024-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10798463","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Open Journal of the Computer Society","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10798463/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Weakly supervised video anomaly detection (WS-VAD) plays a pivotal role in advancing intelligent surveillance systems within the field of computer vision. Despite significant research, WS-VAD continues to face challenges, particularly with unimodal approaches that struggle to extract meaningful features effectively. A few research studies have been done on the multimodal dataset fusion-based WS-VAD system, and their performance accuracy is unsatisfactory. In response, we propose a novel WS-VAD system leveraging multimodal datasets with an attention-enhanced feature fusion approach to address these challenges. Our system integrates three distinct data modalities—RGB video, optical flow, and audio signals—where each stream extracts complementary spatial and temporal features using an enhanced attention module to improve the detection accuracy and robustness. In the RGB video stream, we employ a multi-stage, attention-driven feature enhancement process to refine spatial and temporal features. This process begins with a ViT-based CLIP module, where the top k features are concatenated with I3D- and TCA-based spatiotemporal features. Temporal dependencies are then captured through uncertainty-regulated dual memory units (UR-DMUs), allowing the simultaneous learning of normal and anomalous patterns. The final stage selects the most relevant features, yielding a refined representation of RGB-based data. The second stream extracts enhanced spatiotemporal features from flow data using a deep learning and attention module. Lastly, the audio stream detects anomalies in sound patterns through an attention module integrated with the VGGish model, capturing auditory cues. The fusion of these three streams captures motion and audio signals often missed by visual analysis alone, significantly enhancing anomaly detection accuracy and robustness. Our multimodal fusion achieves an average precision (AP) of 88.28% on the XD-Violence dataset, outperforming prior models by nearly 2%, and attains AUCs of 98.71% on the ShanghaiTech dataset and 90.26% on the UCF-Crime dataset. These results underscore the effectiveness of our approach, consistently surpassing existing methods across three benchmark datasets and validating its robustness in WS-VAD applications.

查看原文本刊更多论文

基于多模态注意力增强特征融合的弱监督异常暴力检测

弱监督视频异常检测（WS-VAD）在推进计算机视觉领域的智能监控系统中起着举足轻重的作用。尽管进行了大量的研究，WS-VAD仍然面临着挑战，特别是在单模方法难以有效地提取有意义的特征方面。对基于多模态数据集融合的WS-VAD系统进行了一些研究，但其性能精度并不理想。作为回应，我们提出了一种新的WS-VAD系统，利用多模态数据集和注意力增强的特征融合方法来解决这些挑战。我们的系统集成了三种不同的数据模式- rgb视频，光流和音频信号-其中每个流使用增强的注意力模块提取互补的空间和时间特征，以提高检测精度和鲁棒性。在RGB视频流中，我们采用多阶段、注意力驱动的特征增强过程来细化空间和时间特征。这个过程从一个基于vit的CLIP模块开始，其中最上面的k个特征与基于I3D和tca的时空特征相连接。然后通过不确定性调节双记忆单元（ur - dmu）捕获时间依赖性，允许同时学习正常和异常模式。最后阶段选择最相关的特征，生成基于rgb的数据的精细化表示。第二个流使用深度学习和注意力模块从流数据中提取增强的时空特征。最后，音频流通过与VGGish模型集成的注意力模块检测声音模式中的异常，捕捉听觉线索。这三种流的融合捕获了经常被视觉分析遗漏的运动和音频信号，显著提高了异常检测的准确性和鲁棒性。我们的多模态融合在XD-Violence数据集上实现了88.28%的平均精度（AP），比之前的模型高出近2%，在ShanghaiTech数据集上实现了98.71%的aus，在UCF-Crime数据集上实现了90.26%的aus。这些结果强调了我们方法的有效性，在三个基准数据集上始终优于现有方法，并验证了其在WS-VAD应用程序中的鲁棒性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Open Journal of the Computer Society

CiteScore

12.60

自引率

0.00%

发文量