{"title":"Two-Stage Spatio- Temporal Vision Transformer for the Detection of Violent Scenes","authors":"M. Constantin, B. Ionescu","doi":"10.1109/comm54429.2022.9817200","DOIUrl":null,"url":null,"abstract":"The rapid expansion and adoption of CCTV systems brings with itself a series of problems that, if remain unchecked, have the potential of hindering the advantages brought by such systems and reduce the effectiveness of this type of system in security surveillance scenarios. The possibly vast quantities of data associated with a CCTV system that covers a city or problematic areas of that city, venues, events, industrial sites or even smaller security perimeters can over-whelm the human operators and make it hard to distinguish important security events from the rest of the normal data. Therefore, the creation of automated systems that are able to provide operators with accurate alarms when certain events take place is of paramount importance, as this can heavily reduce their workload and improve the efficiency of the system. In this regard, we propose a Two-Stage Vision Transformer-based (2SViT) system for the detection of violent scenes. In this setup, the first stage handles frame-level processing, while the second stage processes temporal information by gathering frame-level features. We train and validate our proposed Transformer architecture on the popular XD- Violence dataset, while testing some size variations for the architecture, and show good results when compared with baseline scores.","PeriodicalId":118077,"journal":{"name":"2022 14th International Conference on Communications (COMM)","volume":"30 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-06-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 14th International Conference on Communications (COMM)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/comm54429.2022.9817200","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
The rapid expansion and adoption of CCTV systems brings with itself a series of problems that, if remain unchecked, have the potential of hindering the advantages brought by such systems and reduce the effectiveness of this type of system in security surveillance scenarios. The possibly vast quantities of data associated with a CCTV system that covers a city or problematic areas of that city, venues, events, industrial sites or even smaller security perimeters can over-whelm the human operators and make it hard to distinguish important security events from the rest of the normal data. Therefore, the creation of automated systems that are able to provide operators with accurate alarms when certain events take place is of paramount importance, as this can heavily reduce their workload and improve the efficiency of the system. In this regard, we propose a Two-Stage Vision Transformer-based (2SViT) system for the detection of violent scenes. In this setup, the first stage handles frame-level processing, while the second stage processes temporal information by gathering frame-level features. We train and validate our proposed Transformer architecture on the popular XD- Violence dataset, while testing some size variations for the architecture, and show good results when compared with baseline scores.