基于多路径激励的关键帧引导视频Swin变压器暴力检测

IF 1.5 4区计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

Computer Journal Pub Date : 2023-10-20 DOI:10.1093/comjnl/bxad103

Chenghao Li, Xinyan Yang, Gang Liang

{"title":"基于多路径激励的关键帧引导视频Swin变压器暴力检测","authors":"Chenghao Li, Xinyan Yang, Gang Liang","doi":"10.1093/comjnl/bxad103","DOIUrl":null,"url":null,"abstract":"Abstract Violence detection is a critical task aimed at identifying violent behavior in video by extracting frames and applying classification models. However, the complexity of video data and the suddenness of violent events present significant hurdles in accurately pinpointing instances of violence, making the extraction of frames that indicate violence a challenging endeavor. Furthermore, designing and applying high-performance models for violence detection remains an open problem. Traditional models embed extracted spatial features from sampled frames directly into a temporal sequence, which ignores the spatio-temporal characteristics of video and limits the ability to express continuous changes between adjacent frames. To address the existing challenges, this paper proposes a novel framework called ACTION-VST. First, a keyframe extraction algorithm is developed to select frames that are most likely to represent violent scenes in videos. To transform visual sequences into spatio-temporal feature maps, a multi-path excitation module is proposed to activate spatio-temporal, channel and motion features. Next, an advanced Video Swin Transformer-based network is employed for both global and local spatio-temporal modeling, which enables comprehensive feature extraction and representation of violence. The proposed method was validated on two large-scale datasets, RLVS and RWF-2000, achieving accuracies of over 98 and 93%, respectively, surpassing the state of the art.","PeriodicalId":50641,"journal":{"name":"Computer Journal","volume":"127 46 1","pages":"0"},"PeriodicalIF":1.5000,"publicationDate":"2023-10-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Keyframe-guided Video Swin Transformer with Multi-path Excitation for Violence Detection\",\"authors\":\"Chenghao Li, Xinyan Yang, Gang Liang\",\"doi\":\"10.1093/comjnl/bxad103\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Abstract Violence detection is a critical task aimed at identifying violent behavior in video by extracting frames and applying classification models. However, the complexity of video data and the suddenness of violent events present significant hurdles in accurately pinpointing instances of violence, making the extraction of frames that indicate violence a challenging endeavor. Furthermore, designing and applying high-performance models for violence detection remains an open problem. Traditional models embed extracted spatial features from sampled frames directly into a temporal sequence, which ignores the spatio-temporal characteristics of video and limits the ability to express continuous changes between adjacent frames. To address the existing challenges, this paper proposes a novel framework called ACTION-VST. First, a keyframe extraction algorithm is developed to select frames that are most likely to represent violent scenes in videos. To transform visual sequences into spatio-temporal feature maps, a multi-path excitation module is proposed to activate spatio-temporal, channel and motion features. Next, an advanced Video Swin Transformer-based network is employed for both global and local spatio-temporal modeling, which enables comprehensive feature extraction and representation of violence. The proposed method was validated on two large-scale datasets, RLVS and RWF-2000, achieving accuracies of over 98 and 93%, respectively, surpassing the state of the art.\",\"PeriodicalId\":50641,\"journal\":{\"name\":\"Computer Journal\",\"volume\":\"127 46 1\",\"pages\":\"0\"},\"PeriodicalIF\":1.5000,\"publicationDate\":\"2023-10-20\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Computer Journal\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1093/comjnl/bxad103\",\"RegionNum\":4,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q4\",\"JCRName\":\"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Journal","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1093/comjnl/bxad103","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}

引用次数: 0

摘要

摘要暴力检测是一项关键任务，旨在通过提取帧并应用分类模型来识别视频中的暴力行为。然而，视频数据的复杂性和暴力事件的突发性给准确定位暴力事件带来了重大障碍，使得提取表明暴力的帧成为一项具有挑战性的工作。此外，设计和应用高性能的暴力检测模型仍然是一个悬而未决的问题。传统模型将从采样帧中提取的空间特征直接嵌入到时间序列中，忽略了视频的时空特征，限制了表达相邻帧之间连续变化的能力。为了解决现有的挑战，本文提出了一个名为ACTION-VST的新框架。首先，开发了一种关键帧提取算法，以选择最可能代表视频中暴力场景的帧。为了将视觉序列转化为时空特征映射，提出了一种多路径激励模块来激活时空、通道和运动特征。其次，采用一种先进的基于视频旋转变压器的网络进行全局和局部时空建模，从而实现对暴力的全面特征提取和表示。该方法在RLVS和RWF-2000两个大型数据集上进行了验证，准确率分别超过98%和93%，超过了目前的水平。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Keyframe-guided Video Swin Transformer with Multi-path Excitation for Violence Detection

Abstract Violence detection is a critical task aimed at identifying violent behavior in video by extracting frames and applying classification models. However, the complexity of video data and the suddenness of violent events present significant hurdles in accurately pinpointing instances of violence, making the extraction of frames that indicate violence a challenging endeavor. Furthermore, designing and applying high-performance models for violence detection remains an open problem. Traditional models embed extracted spatial features from sampled frames directly into a temporal sequence, which ignores the spatio-temporal characteristics of video and limits the ability to express continuous changes between adjacent frames. To address the existing challenges, this paper proposes a novel framework called ACTION-VST. First, a keyframe extraction algorithm is developed to select frames that are most likely to represent violent scenes in videos. To transform visual sequences into spatio-temporal feature maps, a multi-path excitation module is proposed to activate spatio-temporal, channel and motion features. Next, an advanced Video Swin Transformer-based network is employed for both global and local spatio-temporal modeling, which enables comprehensive feature extraction and representation of violence. The proposed method was validated on two large-scale datasets, RLVS and RWF-2000, achieving accuracies of over 98 and 93%, respectively, surpassing the state of the art.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Computer Journal 工程技术-计算机：软件工程

CiteScore

3.60

自引率

7.10%

发文量

164

审稿时长

4.8 months

期刊介绍： The Computer Journal is one of the longest-established journals serving all branches of the academic computer science community. It is currently published in four sections.