Jungpil Shin;Abu Saleh Musa Miah;Yuta Kaneko;Najmul Hassan;Hyoun-Sup Lee;Si-Woong Jang
{"title":"Multimodal Attention-Enhanced Feature Fusion-Based Weakly Supervised Anomaly Violence Detection","authors":"Jungpil Shin;Abu Saleh Musa Miah;Yuta Kaneko;Najmul Hassan;Hyoun-Sup Lee;Si-Woong Jang","doi":"10.1109/OJCS.2024.3517154","DOIUrl":null,"url":null,"abstract":"Weakly supervised video anomaly detection (WS-VAD) plays a pivotal role in advancing intelligent surveillance systems within the field of computer vision. Despite significant research, WS-VAD continues to face challenges, particularly with unimodal approaches that struggle to extract meaningful features effectively. A few research studies have been done on the multimodal dataset fusion-based WS-VAD system, and their performance accuracy is unsatisfactory. In response, we propose a novel WS-VAD system leveraging multimodal datasets with an attention-enhanced feature fusion approach to address these challenges. Our system integrates three distinct data modalities—RGB video, optical flow, and audio signals—where each stream extracts complementary spatial and temporal features using an enhanced attention module to improve the detection accuracy and robustness. In the RGB video stream, we employ a multi-stage, attention-driven feature enhancement process to refine spatial and temporal features. This process begins with a ViT-based CLIP module, where the top k features are concatenated with I3D- and TCA-based spatiotemporal features. Temporal dependencies are then captured through uncertainty-regulated dual memory units (UR-DMUs), allowing the simultaneous learning of normal and anomalous patterns. The final stage selects the most relevant features, yielding a refined representation of RGB-based data. The second stream extracts enhanced spatiotemporal features from flow data using a deep learning and attention module. Lastly, the audio stream detects anomalies in sound patterns through an attention module integrated with the VGGish model, capturing auditory cues. The fusion of these three streams captures motion and audio signals often missed by visual analysis alone, significantly enhancing anomaly detection accuracy and robustness. Our multimodal fusion achieves an average precision (AP) of 88.28% on the XD-Violence dataset, outperforming prior models by nearly 2%, and attains AUCs of 98.71% on the ShanghaiTech dataset and 90.26% on the UCF-Crime dataset. These results underscore the effectiveness of our approach, consistently surpassing existing methods across three benchmark datasets and validating its robustness in WS-VAD applications.","PeriodicalId":13205,"journal":{"name":"IEEE Open Journal of the Computer Society","volume":"6 ","pages":"129-140"},"PeriodicalIF":0.0000,"publicationDate":"2024-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10798463","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Open Journal of the Computer Society","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10798463/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Weakly supervised video anomaly detection (WS-VAD) plays a pivotal role in advancing intelligent surveillance systems within the field of computer vision. Despite significant research, WS-VAD continues to face challenges, particularly with unimodal approaches that struggle to extract meaningful features effectively. A few research studies have been done on the multimodal dataset fusion-based WS-VAD system, and their performance accuracy is unsatisfactory. In response, we propose a novel WS-VAD system leveraging multimodal datasets with an attention-enhanced feature fusion approach to address these challenges. Our system integrates three distinct data modalities—RGB video, optical flow, and audio signals—where each stream extracts complementary spatial and temporal features using an enhanced attention module to improve the detection accuracy and robustness. In the RGB video stream, we employ a multi-stage, attention-driven feature enhancement process to refine spatial and temporal features. This process begins with a ViT-based CLIP module, where the top k features are concatenated with I3D- and TCA-based spatiotemporal features. Temporal dependencies are then captured through uncertainty-regulated dual memory units (UR-DMUs), allowing the simultaneous learning of normal and anomalous patterns. The final stage selects the most relevant features, yielding a refined representation of RGB-based data. The second stream extracts enhanced spatiotemporal features from flow data using a deep learning and attention module. Lastly, the audio stream detects anomalies in sound patterns through an attention module integrated with the VGGish model, capturing auditory cues. The fusion of these three streams captures motion and audio signals often missed by visual analysis alone, significantly enhancing anomaly detection accuracy and robustness. Our multimodal fusion achieves an average precision (AP) of 88.28% on the XD-Violence dataset, outperforming prior models by nearly 2%, and attains AUCs of 98.71% on the ShanghaiTech dataset and 90.26% on the UCF-Crime dataset. These results underscore the effectiveness of our approach, consistently surpassing existing methods across three benchmark datasets and validating its robustness in WS-VAD applications.