Yishuo Liu , Chuanxu Wang , Qingyang Yang , Lanxiao Li , Binghui Wang
{"title":"Self-supervised learning video anomaly detection based on time interval prediction and noise classification","authors":"Yishuo Liu , Chuanxu Wang , Qingyang Yang , Lanxiao Li , Binghui Wang","doi":"10.1016/j.patcog.2025.112198","DOIUrl":null,"url":null,"abstract":"<div><div>Video Anomaly Detection (VAD) aims to automatically identify anomalous events in videos that significantly deviate from normal behavioral patterns. Self-supervised learning motivates models to learn effective features from unlabeled data by designing proxy tasks. However, existing approaches often rely on coarse-grained modeling, focusing mainly on global sequence order or holistic scene structures, which may limit their ability to capture subtle motion changes or localized anomalies. Therefore, this paper proposes a self-supervised learning framework combined with fine-grained spatio-temporal proxy tasks to extract key features more accurately. For the temporal branch, we design a time interval prediction task: given a fixed middle frame and randomly sampled frames from both sides, the model predicts their temporal intervals relative to the center frame, thereby modeling the dynamic patterns of behavior. To enhance temporal modeling capabilities, we introduce a multi-head self-attention mechanism to capture inter-frame dependencies in the input sequence. The spatial branch employs a noise classification task inspired by diffusion models, where varying levels of noise are added to image patches, and the model predicts the corresponding noise levels. This encourages learning of local appearance features and patch-level sensitivity to perturbations. Our method is trained in an end-to-end manner and does not rely on pre-trained models. Experiments on three benchmark datasets demonstrate stable performance: the method achieves AUC scores of 98.6 % on UCSD Ped2, 91.7 % on CUHK Avenue, and 83.7 % on ShanghaiTech. These results suggest that the proposed approach can generalize well across different scenes, perspectives, and types of anomalous behavior.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"171 ","pages":"Article 112198"},"PeriodicalIF":7.6000,"publicationDate":"2025-07-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Pattern Recognition","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0031320325008593","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Video Anomaly Detection (VAD) aims to automatically identify anomalous events in videos that significantly deviate from normal behavioral patterns. Self-supervised learning motivates models to learn effective features from unlabeled data by designing proxy tasks. However, existing approaches often rely on coarse-grained modeling, focusing mainly on global sequence order or holistic scene structures, which may limit their ability to capture subtle motion changes or localized anomalies. Therefore, this paper proposes a self-supervised learning framework combined with fine-grained spatio-temporal proxy tasks to extract key features more accurately. For the temporal branch, we design a time interval prediction task: given a fixed middle frame and randomly sampled frames from both sides, the model predicts their temporal intervals relative to the center frame, thereby modeling the dynamic patterns of behavior. To enhance temporal modeling capabilities, we introduce a multi-head self-attention mechanism to capture inter-frame dependencies in the input sequence. The spatial branch employs a noise classification task inspired by diffusion models, where varying levels of noise are added to image patches, and the model predicts the corresponding noise levels. This encourages learning of local appearance features and patch-level sensitivity to perturbations. Our method is trained in an end-to-end manner and does not rely on pre-trained models. Experiments on three benchmark datasets demonstrate stable performance: the method achieves AUC scores of 98.6 % on UCSD Ped2, 91.7 % on CUHK Avenue, and 83.7 % on ShanghaiTech. These results suggest that the proposed approach can generalize well across different scenes, perspectives, and types of anomalous behavior.
期刊介绍:
The field of Pattern Recognition is both mature and rapidly evolving, playing a crucial role in various related fields such as computer vision, image processing, text analysis, and neural networks. It closely intersects with machine learning and is being applied in emerging areas like biometrics, bioinformatics, multimedia data analysis, and data science. The journal Pattern Recognition, established half a century ago during the early days of computer science, has since grown significantly in scope and influence.