{"title":"Multi-scale spatiotemporal normality learning for unsupervised video anomaly detection","authors":"Caitian Liu, Linxiao Gong, Xiong Chen","doi":"10.1007/s10489-025-06485-3","DOIUrl":null,"url":null,"abstract":"<div><p>Video anomaly detection aims to automatically identify abnormal spatiotemporal patterns in surveillance videos. While unsupervised methods avoid the high cost of collecting abnormal data by learning from regular events, they often struggle to effectively model the inherent multiscale nature of video data. To address this challenge, we propose Multi-Scale Spatiotemporal Normality Learning (MS<span>\\(^2\\)</span>NL), a unified framework that systematically processes and integrates multiscale features across both spatial and temporal dimensions. Our framework employs an attention-enhanced stepwise fusion module to aggregate spatial features at different resolutions, enabling comprehensive modeling of appearance patterns from local textures to global structures. For temporal information processing, we design a dynamic aggregation module based on one-dimensional dilated convolutions that effectively captures motion dependencies across multi-scale feature maps while maintaining computational efficiency. These multiscale features are processed through dual decoders: a temporal decoder that learns motion normality through RGB-to-optical-flow mapping, and a spatial decoder that models appearance normality via future frame prediction, with multiscale prototype features stored in an external memory network. This sophisticated handling of multiscale information enables MS<span>\\(^2\\)</span>NL to capture subtle spatial deviations while maintaining sensitivity to temporal anomalies. Extensive experiments on benchmark datasets demonstrate the effectiveness of our approach, achieving state-of-the-art frame-level AUROCs of 98.3%, 91.5%, and 74.9% on the UCSD Ped2, CUHK Avenue, and ShanghaiTech datasets, respectively.</p></div>","PeriodicalId":8041,"journal":{"name":"Applied Intelligence","volume":"55 7","pages":""},"PeriodicalIF":3.4000,"publicationDate":"2025-03-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://link.springer.com/content/pdf/10.1007/s10489-025-06485-3.pdf","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Applied Intelligence","FirstCategoryId":"94","ListUrlMain":"https://link.springer.com/article/10.1007/s10489-025-06485-3","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Video anomaly detection aims to automatically identify abnormal spatiotemporal patterns in surveillance videos. While unsupervised methods avoid the high cost of collecting abnormal data by learning from regular events, they often struggle to effectively model the inherent multiscale nature of video data. To address this challenge, we propose Multi-Scale Spatiotemporal Normality Learning (MS\(^2\)NL), a unified framework that systematically processes and integrates multiscale features across both spatial and temporal dimensions. Our framework employs an attention-enhanced stepwise fusion module to aggregate spatial features at different resolutions, enabling comprehensive modeling of appearance patterns from local textures to global structures. For temporal information processing, we design a dynamic aggregation module based on one-dimensional dilated convolutions that effectively captures motion dependencies across multi-scale feature maps while maintaining computational efficiency. These multiscale features are processed through dual decoders: a temporal decoder that learns motion normality through RGB-to-optical-flow mapping, and a spatial decoder that models appearance normality via future frame prediction, with multiscale prototype features stored in an external memory network. This sophisticated handling of multiscale information enables MS\(^2\)NL to capture subtle spatial deviations while maintaining sensitivity to temporal anomalies. Extensive experiments on benchmark datasets demonstrate the effectiveness of our approach, achieving state-of-the-art frame-level AUROCs of 98.3%, 91.5%, and 74.9% on the UCSD Ped2, CUHK Avenue, and ShanghaiTech datasets, respectively.
期刊介绍:
With a focus on research in artificial intelligence and neural networks, this journal addresses issues involving solutions of real-life manufacturing, defense, management, government and industrial problems which are too complex to be solved through conventional approaches and require the simulation of intelligent thought processes, heuristics, applications of knowledge, and distributed and parallel processing. The integration of these multiple approaches in solving complex problems is of particular importance.
The journal presents new and original research and technological developments, addressing real and complex issues applicable to difficult problems. It provides a medium for exchanging scientific research and technological achievements accomplished by the international community.