基于多路径激励的关键帧引导视频Swin变压器暴力检测

IF 1.5 4区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE
Chenghao Li, Xinyan Yang, Gang Liang
{"title":"基于多路径激励的关键帧引导视频Swin变压器暴力检测","authors":"Chenghao Li, Xinyan Yang, Gang Liang","doi":"10.1093/comjnl/bxad103","DOIUrl":null,"url":null,"abstract":"Abstract Violence detection is a critical task aimed at identifying violent behavior in video by extracting frames and applying classification models. However, the complexity of video data and the suddenness of violent events present significant hurdles in accurately pinpointing instances of violence, making the extraction of frames that indicate violence a challenging endeavor. Furthermore, designing and applying high-performance models for violence detection remains an open problem. Traditional models embed extracted spatial features from sampled frames directly into a temporal sequence, which ignores the spatio-temporal characteristics of video and limits the ability to express continuous changes between adjacent frames. To address the existing challenges, this paper proposes a novel framework called ACTION-VST. First, a keyframe extraction algorithm is developed to select frames that are most likely to represent violent scenes in videos. To transform visual sequences into spatio-temporal feature maps, a multi-path excitation module is proposed to activate spatio-temporal, channel and motion features. Next, an advanced Video Swin Transformer-based network is employed for both global and local spatio-temporal modeling, which enables comprehensive feature extraction and representation of violence. The proposed method was validated on two large-scale datasets, RLVS and RWF-2000, achieving accuracies of over 98 and 93%, respectively, surpassing the state of the art.","PeriodicalId":50641,"journal":{"name":"Computer Journal","volume":null,"pages":null},"PeriodicalIF":1.5000,"publicationDate":"2023-10-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Keyframe-guided Video Swin Transformer with Multi-path Excitation for Violence Detection\",\"authors\":\"Chenghao Li, Xinyan Yang, Gang Liang\",\"doi\":\"10.1093/comjnl/bxad103\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Abstract Violence detection is a critical task aimed at identifying violent behavior in video by extracting frames and applying classification models. However, the complexity of video data and the suddenness of violent events present significant hurdles in accurately pinpointing instances of violence, making the extraction of frames that indicate violence a challenging endeavor. Furthermore, designing and applying high-performance models for violence detection remains an open problem. Traditional models embed extracted spatial features from sampled frames directly into a temporal sequence, which ignores the spatio-temporal characteristics of video and limits the ability to express continuous changes between adjacent frames. To address the existing challenges, this paper proposes a novel framework called ACTION-VST. First, a keyframe extraction algorithm is developed to select frames that are most likely to represent violent scenes in videos. To transform visual sequences into spatio-temporal feature maps, a multi-path excitation module is proposed to activate spatio-temporal, channel and motion features. Next, an advanced Video Swin Transformer-based network is employed for both global and local spatio-temporal modeling, which enables comprehensive feature extraction and representation of violence. The proposed method was validated on two large-scale datasets, RLVS and RWF-2000, achieving accuracies of over 98 and 93%, respectively, surpassing the state of the art.\",\"PeriodicalId\":50641,\"journal\":{\"name\":\"Computer Journal\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":1.5000,\"publicationDate\":\"2023-10-20\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Computer Journal\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1093/comjnl/bxad103\",\"RegionNum\":4,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q4\",\"JCRName\":\"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Journal","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1093/comjnl/bxad103","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}
引用次数: 0

摘要

摘要暴力检测是一项关键任务,旨在通过提取帧并应用分类模型来识别视频中的暴力行为。然而,视频数据的复杂性和暴力事件的突发性给准确定位暴力事件带来了重大障碍,使得提取表明暴力的帧成为一项具有挑战性的工作。此外,设计和应用高性能的暴力检测模型仍然是一个悬而未决的问题。传统模型将从采样帧中提取的空间特征直接嵌入到时间序列中,忽略了视频的时空特征,限制了表达相邻帧之间连续变化的能力。为了解决现有的挑战,本文提出了一个名为ACTION-VST的新框架。首先,开发了一种关键帧提取算法,以选择最可能代表视频中暴力场景的帧。为了将视觉序列转化为时空特征映射,提出了一种多路径激励模块来激活时空、通道和运动特征。其次,采用一种先进的基于视频旋转变压器的网络进行全局和局部时空建模,从而实现对暴力的全面特征提取和表示。该方法在RLVS和RWF-2000两个大型数据集上进行了验证,准确率分别超过98%和93%,超过了目前的水平。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Keyframe-guided Video Swin Transformer with Multi-path Excitation for Violence Detection
Abstract Violence detection is a critical task aimed at identifying violent behavior in video by extracting frames and applying classification models. However, the complexity of video data and the suddenness of violent events present significant hurdles in accurately pinpointing instances of violence, making the extraction of frames that indicate violence a challenging endeavor. Furthermore, designing and applying high-performance models for violence detection remains an open problem. Traditional models embed extracted spatial features from sampled frames directly into a temporal sequence, which ignores the spatio-temporal characteristics of video and limits the ability to express continuous changes between adjacent frames. To address the existing challenges, this paper proposes a novel framework called ACTION-VST. First, a keyframe extraction algorithm is developed to select frames that are most likely to represent violent scenes in videos. To transform visual sequences into spatio-temporal feature maps, a multi-path excitation module is proposed to activate spatio-temporal, channel and motion features. Next, an advanced Video Swin Transformer-based network is employed for both global and local spatio-temporal modeling, which enables comprehensive feature extraction and representation of violence. The proposed method was validated on two large-scale datasets, RLVS and RWF-2000, achieving accuracies of over 98 and 93%, respectively, surpassing the state of the art.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Computer Journal
Computer Journal 工程技术-计算机:软件工程
CiteScore
3.60
自引率
7.10%
发文量
164
审稿时长
4.8 months
期刊介绍: The Computer Journal is one of the longest-established journals serving all branches of the academic computer science community. It is currently published in four sections.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信