重新审视视频异常检测的自监督多任务学习

Comput. Vis. Image Underst. Pub Date : 2022-07-16 DOI:10.48550/arXiv.2207.08003

Antonio Bărbălău, Radu Tudor Ionescu, Mariana-Iuliana Georgescu, J. Dueholm, B. Ramachandra, Kamal Nasrollahi, F. Khan, T. Moeslund, M. Shah

{"title":"重新审视视频异常检测的自监督多任务学习","authors":"Antonio Bărbălău, Radu Tudor Ionescu, Mariana-Iuliana Georgescu, J. Dueholm, B. Ramachandra, Kamal Nasrollahi, F. Khan, T. Moeslund, M. Shah","doi":"10.48550/arXiv.2207.08003","DOIUrl":null,"url":null,"abstract":"A self-supervised multi-task learning (SSMTL) framework for video anomaly detection was recently introduced in literature. Due to its highly accurate results, the method attracted the attention of many researchers. In this work, we revisit the self-supervised multi-task learning framework, proposing several updates to the original method. First, we study various detection methods, e.g. based on detecting high-motion regions using optical flow or background subtraction, since we believe the currently used pre-trained YOLOv3 is suboptimal, e.g. objects in motion or objects from unknown classes are never detected. Second, we modernize the 3D convolutional backbone by introducing multi-head self-attention modules, inspired by the recent success of vision transformers. As such, we alternatively introduce both 2D and 3D convolutional vision transformer (CvT) blocks. Third, in our attempt to further improve the model, we study additional self-supervised learning tasks, such as predicting segmentation maps through knowledge distillation, solving jigsaw puzzles, estimating body pose through knowledge distillation, predicting masked regions (inpainting), and adversarial learning with pseudo-anomalies. We conduct experiments to assess the performance impact of the introduced changes. Upon finding more promising configurations of the framework, dubbed SSMTL++v1 and SSMTL++v2, we extend our preliminary experiments to more data sets, demonstrating that our performance gains are consistent across all data sets. In most cases, our results on Avenue, ShanghaiTech and UBnormal raise the state-of-the-art performance bar to a new level.","PeriodicalId":10549,"journal":{"name":"Comput. Vis. Image Underst.","volume":"47 1","pages":"103656"},"PeriodicalIF":0.0000,"publicationDate":"2022-07-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"19","resultStr":"{\"title\":\"SSMTL++: Revisiting Self-Supervised Multi-Task Learning for Video Anomaly Detection\",\"authors\":\"Antonio Bărbălău, Radu Tudor Ionescu, Mariana-Iuliana Georgescu, J. Dueholm, B. Ramachandra, Kamal Nasrollahi, F. Khan, T. Moeslund, M. Shah\",\"doi\":\"10.48550/arXiv.2207.08003\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"A self-supervised multi-task learning (SSMTL) framework for video anomaly detection was recently introduced in literature. Due to its highly accurate results, the method attracted the attention of many researchers. In this work, we revisit the self-supervised multi-task learning framework, proposing several updates to the original method. First, we study various detection methods, e.g. based on detecting high-motion regions using optical flow or background subtraction, since we believe the currently used pre-trained YOLOv3 is suboptimal, e.g. objects in motion or objects from unknown classes are never detected. Second, we modernize the 3D convolutional backbone by introducing multi-head self-attention modules, inspired by the recent success of vision transformers. As such, we alternatively introduce both 2D and 3D convolutional vision transformer (CvT) blocks. Third, in our attempt to further improve the model, we study additional self-supervised learning tasks, such as predicting segmentation maps through knowledge distillation, solving jigsaw puzzles, estimating body pose through knowledge distillation, predicting masked regions (inpainting), and adversarial learning with pseudo-anomalies. We conduct experiments to assess the performance impact of the introduced changes. Upon finding more promising configurations of the framework, dubbed SSMTL++v1 and SSMTL++v2, we extend our preliminary experiments to more data sets, demonstrating that our performance gains are consistent across all data sets. In most cases, our results on Avenue, ShanghaiTech and UBnormal raise the state-of-the-art performance bar to a new level.\",\"PeriodicalId\":10549,\"journal\":{\"name\":\"Comput. Vis. Image Underst.\",\"volume\":\"47 1\",\"pages\":\"103656\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-07-16\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"19\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Comput. Vis. Image Underst.\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.48550/arXiv.2207.08003\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Comput. Vis. Image Underst.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.48550/arXiv.2207.08003","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 19

摘要

近年来，文献中提出了一种用于视频异常检测的自监督多任务学习(SSMTL)框架。由于其结果的高度准确性，该方法引起了许多研究者的注意。在这项工作中，我们重新审视了自监督多任务学习框架，并对原始方法提出了一些更新。首先，我们研究了各种检测方法，例如基于使用光流或背景减法检测高运动区域，因为我们认为目前使用的预训练YOLOv3是次优的，例如运动中的物体或来自未知类别的物体从未被检测到。其次，我们通过引入多头自关注模块来实现3D卷积主干的现代化，灵感来自最近视觉变压器的成功。因此，我们交替地引入2D和3D卷积视觉变压器(CvT)块。第三，为了进一步改进模型，我们研究了额外的自监督学习任务，例如通过知识蒸馏预测分割图，解决拼图，通过知识蒸馏估计身体姿势，预测屏蔽区域(油漆)，以及使用伪异常进行对抗性学习。我们进行实验来评估引入的更改对性能的影响。在找到更有希望的框架配置(称为ssmtl++ v1和ssmtl++ v2)之后，我们将初步实验扩展到更多数据集，证明我们的性能提升在所有数据集上都是一致的。在大多数情况下，我们在Avenue, ShanghaiTech和UBnormal的结果将最先进的性能标准提高到一个新的水平。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

SSMTL++: Revisiting Self-Supervised Multi-Task Learning for Video Anomaly Detection

A self-supervised multi-task learning (SSMTL) framework for video anomaly detection was recently introduced in literature. Due to its highly accurate results, the method attracted the attention of many researchers. In this work, we revisit the self-supervised multi-task learning framework, proposing several updates to the original method. First, we study various detection methods, e.g. based on detecting high-motion regions using optical flow or background subtraction, since we believe the currently used pre-trained YOLOv3 is suboptimal, e.g. objects in motion or objects from unknown classes are never detected. Second, we modernize the 3D convolutional backbone by introducing multi-head self-attention modules, inspired by the recent success of vision transformers. As such, we alternatively introduce both 2D and 3D convolutional vision transformer (CvT) blocks. Third, in our attempt to further improve the model, we study additional self-supervised learning tasks, such as predicting segmentation maps through knowledge distillation, solving jigsaw puzzles, estimating body pose through knowledge distillation, predicting masked regions (inpainting), and adversarial learning with pseudo-anomalies. We conduct experiments to assess the performance impact of the introduced changes. Upon finding more promising configurations of the framework, dubbed SSMTL++v1 and SSMTL++v2, we extend our preliminary experiments to more data sets, demonstrating that our performance gains are consistent across all data sets. In most cases, our results on Avenue, ShanghaiTech and UBnormal raise the state-of-the-art performance bar to a new level.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Comput. Vis. Image Underst.

自引率

0.00%

发文量