{"title":"MMVAD: A vision–language model for cross-domain video anomaly detection with contrastive learning and scale-adaptive frame segmentation","authors":"Debojyoti Biswas, Jelena Tesic","doi":"10.1016/j.eswa.2025.127857","DOIUrl":null,"url":null,"abstract":"<div><div>Video Anomaly Detection (VAD) is crucial for public safety and detecting abnormalities in risk-prone zones. However, detecting anomalies from weakly labeled datasets has been very challenging for CCTV surveillance videos. The challenge is more intense when we involve high-altitude drone videos for VAD tasks. Very few works have been done on drone-captured VAD, and even the existing CCTV VAD methods suffer from several limitations that hinder their optimal performance. Previous VAD works mostly used single modal data, <em>e.g.</em>, video data, which was insufficient to understand the context of complex scenes. Moreover, the existing multimodal systems use the traditional linear fusion method to capture multimodal feature interaction, which does not address the misalignment issue from different modalities. Next, the existing work relies on fixed-scale video segmentation, which fails to preserve the fine-grained local and global context knowledge. Also, it was found that the feature magnitude-based VAD does not correctly represent the anomalous events. To address these issues, we present a novel vision–language-based video anomaly detection for drone videos. We use adaptive long-short-term video segmentation (ALSVS) for local–global knowledge extraction. Next, we propose to use a shallow yet efficient attention-based feature fusion (AFF) technique for multimodal VAD (MMVAD) tasks. Finally, for the first time, we introduce feature anomaly learning based on a saliency-aware contrastive algorithm. We found contrastive anomaly feature learning is more robust than the magnitude-based loss calculation. We performed experiments on two of the latest drone VAD datasets (Drone-Anomaly and UIT Drone), as well as two CCTV VAD datasets (UCF crime and XD-Violence). Compared to the baseline and closest SOTA, we achieved at least a +3.8% and +3.3% increase in AUC, respectively, for the drone and CCTV datasets.</div></div>","PeriodicalId":50461,"journal":{"name":"Expert Systems with Applications","volume":"285 ","pages":"Article 127857"},"PeriodicalIF":7.5000,"publicationDate":"2025-05-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Expert Systems with Applications","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0957417425014794","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Video Anomaly Detection (VAD) is crucial for public safety and detecting abnormalities in risk-prone zones. However, detecting anomalies from weakly labeled datasets has been very challenging for CCTV surveillance videos. The challenge is more intense when we involve high-altitude drone videos for VAD tasks. Very few works have been done on drone-captured VAD, and even the existing CCTV VAD methods suffer from several limitations that hinder their optimal performance. Previous VAD works mostly used single modal data, e.g., video data, which was insufficient to understand the context of complex scenes. Moreover, the existing multimodal systems use the traditional linear fusion method to capture multimodal feature interaction, which does not address the misalignment issue from different modalities. Next, the existing work relies on fixed-scale video segmentation, which fails to preserve the fine-grained local and global context knowledge. Also, it was found that the feature magnitude-based VAD does not correctly represent the anomalous events. To address these issues, we present a novel vision–language-based video anomaly detection for drone videos. We use adaptive long-short-term video segmentation (ALSVS) for local–global knowledge extraction. Next, we propose to use a shallow yet efficient attention-based feature fusion (AFF) technique for multimodal VAD (MMVAD) tasks. Finally, for the first time, we introduce feature anomaly learning based on a saliency-aware contrastive algorithm. We found contrastive anomaly feature learning is more robust than the magnitude-based loss calculation. We performed experiments on two of the latest drone VAD datasets (Drone-Anomaly and UIT Drone), as well as two CCTV VAD datasets (UCF crime and XD-Violence). Compared to the baseline and closest SOTA, we achieved at least a +3.8% and +3.3% increase in AUC, respectively, for the drone and CCTV datasets.
期刊介绍:
Expert Systems With Applications is an international journal dedicated to the exchange of information on expert and intelligent systems used globally in industry, government, and universities. The journal emphasizes original papers covering the design, development, testing, implementation, and management of these systems, offering practical guidelines. It spans various sectors such as finance, engineering, marketing, law, project management, information management, medicine, and more. The journal also welcomes papers on multi-agent systems, knowledge management, neural networks, knowledge discovery, data mining, and other related areas, excluding applications to military/defense systems.