Self-supervised learning video anomaly detection based on time interval prediction and noise classification

IF 7.6 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Pattern Recognition Pub Date : 2025-07-24 DOI:10.1016/j.patcog.2025.112198

Yishuo Liu , Chuanxu Wang , Qingyang Yang , Lanxiao Li , Binghui Wang

{"title":"Self-supervised learning video anomaly detection based on time interval prediction and noise classification","authors":"Yishuo Liu , Chuanxu Wang , Qingyang Yang , Lanxiao Li , Binghui Wang","doi":"10.1016/j.patcog.2025.112198","DOIUrl":null,"url":null,"abstract":"<div><div>Video Anomaly Detection (VAD) aims to automatically identify anomalous events in videos that significantly deviate from normal behavioral patterns. Self-supervised learning motivates models to learn effective features from unlabeled data by designing proxy tasks. However, existing approaches often rely on coarse-grained modeling, focusing mainly on global sequence order or holistic scene structures, which may limit their ability to capture subtle motion changes or localized anomalies. Therefore, this paper proposes a self-supervised learning framework combined with fine-grained spatio-temporal proxy tasks to extract key features more accurately. For the temporal branch, we design a time interval prediction task: given a fixed middle frame and randomly sampled frames from both sides, the model predicts their temporal intervals relative to the center frame, thereby modeling the dynamic patterns of behavior. To enhance temporal modeling capabilities, we introduce a multi-head self-attention mechanism to capture inter-frame dependencies in the input sequence. The spatial branch employs a noise classification task inspired by diffusion models, where varying levels of noise are added to image patches, and the model predicts the corresponding noise levels. This encourages learning of local appearance features and patch-level sensitivity to perturbations. Our method is trained in an end-to-end manner and does not rely on pre-trained models. Experiments on three benchmark datasets demonstrate stable performance: the method achieves AUC scores of 98.6 % on UCSD Ped2, 91.7 % on CUHK Avenue, and 83.7 % on ShanghaiTech. These results suggest that the proposed approach can generalize well across different scenes, perspectives, and types of anomalous behavior.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"171 ","pages":"Article 112198"},"PeriodicalIF":7.6000,"publicationDate":"2025-07-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Pattern Recognition","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0031320325008593","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Video Anomaly Detection (VAD) aims to automatically identify anomalous events in videos that significantly deviate from normal behavioral patterns. Self-supervised learning motivates models to learn effective features from unlabeled data by designing proxy tasks. However, existing approaches often rely on coarse-grained modeling, focusing mainly on global sequence order or holistic scene structures, which may limit their ability to capture subtle motion changes or localized anomalies. Therefore, this paper proposes a self-supervised learning framework combined with fine-grained spatio-temporal proxy tasks to extract key features more accurately. For the temporal branch, we design a time interval prediction task: given a fixed middle frame and randomly sampled frames from both sides, the model predicts their temporal intervals relative to the center frame, thereby modeling the dynamic patterns of behavior. To enhance temporal modeling capabilities, we introduce a multi-head self-attention mechanism to capture inter-frame dependencies in the input sequence. The spatial branch employs a noise classification task inspired by diffusion models, where varying levels of noise are added to image patches, and the model predicts the corresponding noise levels. This encourages learning of local appearance features and patch-level sensitivity to perturbations. Our method is trained in an end-to-end manner and does not rely on pre-trained models. Experiments on three benchmark datasets demonstrate stable performance: the method achieves AUC scores of 98.6 % on UCSD Ped2, 91.7 % on CUHK Avenue, and 83.7 % on ShanghaiTech. These results suggest that the proposed approach can generalize well across different scenes, perspectives, and types of anomalous behavior.

查看原文本刊更多论文

基于时间间隔预测和噪声分类的自监督学习视频异常检测

视频异常检测（VAD）旨在自动识别视频中明显偏离正常行为模式的异常事件。自监督学习激励模型通过设计代理任务从未标记数据中学习有效特征。然而，现有的方法往往依赖于粗粒度建模，主要关注全局序列顺序或整体场景结构，这可能限制了它们捕捉细微运动变化或局部异常的能力。因此，本文提出了一种结合细粒度时空代理任务的自监督学习框架，以更准确地提取关键特征。对于时间分支，我们设计了一个时间间隔预测任务：给定固定的中间帧和随机采样的两侧帧，模型预测它们相对于中心帧的时间间隔，从而建模动态行为模式。为了增强时间建模能力，我们引入了多头自注意机制来捕获输入序列中的帧间依赖关系。空间分支采用受扩散模型启发的噪声分类任务，将不同级别的噪声添加到图像补丁中，然后模型预测相应的噪声级别。这鼓励学习局部外观特征和补丁级对扰动的敏感性。我们的方法以端到端方式进行训练，不依赖于预训练的模型。在三个基准数据集上的实验表明，该方法在UCSD Ped2上的AUC得分为98.6%，在中大大道上的AUC得分为91.7%，在上海科技上的AUC得分为83.7%。这些结果表明，所提出的方法可以很好地推广到不同的场景、视角和异常行为类型。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Pattern Recognition 工程技术-工程：电子与电气

CiteScore

14.40

自引率

16.20%

发文量

683

审稿时长

5.6 months

期刊介绍： The field of Pattern Recognition is both mature and rapidly evolving, playing a crucial role in various related fields such as computer vision, image processing, text analysis, and neural networks. It closely intersects with machine learning and is being applied in emerging areas like biometrics, bioinformatics, multimedia data analysis, and data science. The journal Pattern Recognition, established half a century ago during the early days of computer science, has since grown significantly in scope and influence.