MMVAD: A vision–language model for cross-domain video anomaly detection with contrastive learning and scale-adaptive frame segmentation

IF 7.5 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Expert Systems with Applications Pub Date : 2025-05-08 DOI:10.1016/j.eswa.2025.127857

Debojyoti Biswas, Jelena Tesic

{"title":"MMVAD: A vision–language model for cross-domain video anomaly detection with contrastive learning and scale-adaptive frame segmentation","authors":"Debojyoti Biswas, Jelena Tesic","doi":"10.1016/j.eswa.2025.127857","DOIUrl":null,"url":null,"abstract":"<div><div>Video Anomaly Detection (VAD) is crucial for public safety and detecting abnormalities in risk-prone zones. However, detecting anomalies from weakly labeled datasets has been very challenging for CCTV surveillance videos. The challenge is more intense when we involve high-altitude drone videos for VAD tasks. Very few works have been done on drone-captured VAD, and even the existing CCTV VAD methods suffer from several limitations that hinder their optimal performance. Previous VAD works mostly used single modal data, <em>e.g.</em>, video data, which was insufficient to understand the context of complex scenes. Moreover, the existing multimodal systems use the traditional linear fusion method to capture multimodal feature interaction, which does not address the misalignment issue from different modalities. Next, the existing work relies on fixed-scale video segmentation, which fails to preserve the fine-grained local and global context knowledge. Also, it was found that the feature magnitude-based VAD does not correctly represent the anomalous events. To address these issues, we present a novel vision–language-based video anomaly detection for drone videos. We use adaptive long-short-term video segmentation (ALSVS) for local–global knowledge extraction. Next, we propose to use a shallow yet efficient attention-based feature fusion (AFF) technique for multimodal VAD (MMVAD) tasks. Finally, for the first time, we introduce feature anomaly learning based on a saliency-aware contrastive algorithm. We found contrastive anomaly feature learning is more robust than the magnitude-based loss calculation. We performed experiments on two of the latest drone VAD datasets (Drone-Anomaly and UIT Drone), as well as two CCTV VAD datasets (UCF crime and XD-Violence). Compared to the baseline and closest SOTA, we achieved at least a +3.8% and +3.3% increase in AUC, respectively, for the drone and CCTV datasets.</div></div>","PeriodicalId":50461,"journal":{"name":"Expert Systems with Applications","volume":"285 ","pages":"Article 127857"},"PeriodicalIF":7.5000,"publicationDate":"2025-05-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Expert Systems with Applications","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0957417425014794","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Video Anomaly Detection (VAD) is crucial for public safety and detecting abnormalities in risk-prone zones. However, detecting anomalies from weakly labeled datasets has been very challenging for CCTV surveillance videos. The challenge is more intense when we involve high-altitude drone videos for VAD tasks. Very few works have been done on drone-captured VAD, and even the existing CCTV VAD methods suffer from several limitations that hinder their optimal performance. Previous VAD works mostly used single modal data, e.g., video data, which was insufficient to understand the context of complex scenes. Moreover, the existing multimodal systems use the traditional linear fusion method to capture multimodal feature interaction, which does not address the misalignment issue from different modalities. Next, the existing work relies on fixed-scale video segmentation, which fails to preserve the fine-grained local and global context knowledge. Also, it was found that the feature magnitude-based VAD does not correctly represent the anomalous events. To address these issues, we present a novel vision–language-based video anomaly detection for drone videos. We use adaptive long-short-term video segmentation (ALSVS) for local–global knowledge extraction. Next, we propose to use a shallow yet efficient attention-based feature fusion (AFF) technique for multimodal VAD (MMVAD) tasks. Finally, for the first time, we introduce feature anomaly learning based on a saliency-aware contrastive algorithm. We found contrastive anomaly feature learning is more robust than the magnitude-based loss calculation. We performed experiments on two of the latest drone VAD datasets (Drone-Anomaly and UIT Drone), as well as two CCTV VAD datasets (UCF crime and XD-Violence). Compared to the baseline and closest SOTA, we achieved at least a +3.8% and +3.3% increase in AUC, respectively, for the drone and CCTV datasets.

查看原文本刊更多论文

MMVAD：基于对比学习和尺度自适应帧分割的跨域视频异常检测的视觉语言模型

视频异常检测（VAD）对于公共安全和危险易发区域的异常检测至关重要。然而，从弱标记数据集中检测异常对于CCTV监控视频来说是非常具有挑战性的。当我们涉及VAD任务的高空无人机视频时，挑战更加激烈。无人机捕获VAD的工作很少，即使是现有的CCTV VAD方法也受到一些限制，阻碍了它们的最佳性能。以往的VAD作品多采用单模态数据，如视频数据，不足以理解复杂场景的背景。此外，现有的多模态系统采用传统的线性融合方法来捕获多模态特征交互，没有解决不同模态的不对准问题。其次，现有的工作依赖于固定规模的视频分割，无法保留细粒度的局部和全局上下文知识。此外，我们还发现基于震级特征的VAD不能正确地表示异常事件。为了解决这些问题，我们提出了一种新的基于视觉语言的无人机视频异常检测方法。我们使用自适应长短期视频分割（ALSVS）进行局部-全局知识提取。接下来，我们提出在多模态VAD （MMVAD）任务中使用一种浅层但高效的基于注意力的特征融合（AFF）技术。最后，我们首次引入了基于显著性感知对比算法的特征异常学习。我们发现对比异常特征学习比基于震级的损失计算更具鲁棒性。我们在两个最新的无人机VAD数据集（drone - anomaly和UIT drone）以及两个CCTV VAD数据集（UCF crime和xd violence）上进行了实验。与基线和最接近的SOTA相比，无人机和CCTV数据集的AUC分别增加了至少3.8%和3.3%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Expert Systems with Applications 工程技术-工程：电子与电气

CiteScore

13.80

自引率

10.60%

发文量

2045

审稿时长

8.7 months

期刊介绍： Expert Systems With Applications is an international journal dedicated to the exchange of information on expert and intelligent systems used globally in industry, government, and universities. The journal emphasizes original papers covering the design, development, testing, implementation, and management of these systems, offering practical guidelines. It spans various sectors such as finance, engineering, marketing, law, project management, information management, medicine, and more. The journal also welcomes papers on multi-agent systems, knowledge management, neural networks, knowledge discovery, data mining, and other related areas, excluding applications to military/defense systems.