Online Action Tube Detection via Resolving the Spatio-temporal Context Pattern

Jingjia Huang, Nannan Li, Jia-Xing Zhong, Thomas H. Li, Ge Li
{"title":"Online Action Tube Detection via Resolving the Spatio-temporal Context Pattern","authors":"Jingjia Huang, Nannan Li, Jia-Xing Zhong, Thomas H. Li, Ge Li","doi":"10.1145/3240508.3240659","DOIUrl":null,"url":null,"abstract":"At present, spatio-temporal action detection in the video is still a challenging problem, considering the complexity of the background, the variety of the action or the change of the viewpoint in the unconstrained environment. Most of current approaches solve the problem via a two-step processing: first detecting actions at each frame; then linking them, which neglects the continuity of the action and operates in an offline and batch processing manner. In this paper, we attempt to build an online action detection model that introduces the spatio-temporal coherence existed among action regions when performing action category inference and position localization. Specifically, we seek to represent the spatio-temporal context pattern via establishing an encoder-decoder model based on the convolutional recurrent network. The model accepts a video snippet as input and encodes the dynamic information of the action in the forward pass. During the backward pass, it resolves such information at each time instant for action detection via fusing the current static or motion cue. Additionally, we propose an incremental action tube generation algorithm, which accomplishes action bounding-boxes association, action label determination and the temporal trimming in a single pass. Our model takes in the appearance, motion or fused signals as input and is tested on two prevailing datasets, UCF-Sports and UCF-101. The experiment results demonstrate the effectiveness of our method which achieves a performance superior or comparable to compared existing approaches.","PeriodicalId":339857,"journal":{"name":"Proceedings of the 26th ACM international conference on Multimedia","volume":"48 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 26th ACM international conference on Multimedia","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3240508.3240659","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 4

Abstract

At present, spatio-temporal action detection in the video is still a challenging problem, considering the complexity of the background, the variety of the action or the change of the viewpoint in the unconstrained environment. Most of current approaches solve the problem via a two-step processing: first detecting actions at each frame; then linking them, which neglects the continuity of the action and operates in an offline and batch processing manner. In this paper, we attempt to build an online action detection model that introduces the spatio-temporal coherence existed among action regions when performing action category inference and position localization. Specifically, we seek to represent the spatio-temporal context pattern via establishing an encoder-decoder model based on the convolutional recurrent network. The model accepts a video snippet as input and encodes the dynamic information of the action in the forward pass. During the backward pass, it resolves such information at each time instant for action detection via fusing the current static or motion cue. Additionally, we propose an incremental action tube generation algorithm, which accomplishes action bounding-boxes association, action label determination and the temporal trimming in a single pass. Our model takes in the appearance, motion or fused signals as input and is tested on two prevailing datasets, UCF-Sports and UCF-101. The experiment results demonstrate the effectiveness of our method which achieves a performance superior or comparable to compared existing approaches.
基于时空上下文模式的在线动作管检测
目前,考虑到背景的复杂性、动作的多样性以及无约束环境下视点的变化,视频中的时空动作检测仍然是一个具有挑战性的问题。目前大多数方法通过两步处理来解决问题:首先检测每帧的动作;然后将它们链接起来,忽略了动作的连续性,以离线和批处理的方式操作。在本文中,我们试图建立一个在线动作检测模型,该模型引入了动作区域在进行动作类别推理和位置定位时存在的时空相干性。具体来说,我们试图通过建立基于卷积循环网络的编码器-解码器模型来表示时空上下文模式。该模型接受视频片段作为输入,并对动作的动态信息进行编码。在反向传递过程中,它通过融合当前静态或运动线索,在每个时间瞬间解析这些信息以进行动作检测。此外,我们还提出了一种增量动作管生成算法,该算法在一次遍历中完成动作边界盒关联、动作标签确定和时间修剪。我们的模型将外观,运动或融合信号作为输入,并在两个流行的数据集UCF-Sports和UCF-101上进行了测试。实验结果证明了该方法的有效性,其性能优于或可与已有的方法相媲美。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信