EScALation: a framework for efficient and scalable spatio-temporal action localization

Bo Chen, K. Nahrstedt
{"title":"EScALation: a framework for efficient and scalable spatio-temporal action localization","authors":"Bo Chen, K. Nahrstedt","doi":"10.1145/3458305.3459598","DOIUrl":null,"url":null,"abstract":"Spatio-temporal action localization aims to detect the spatial location and the start/end time of the action in a video. The state-of-the-art approach uses convolutional neural networks to extract possible bounding boxes for the action in each frame and then link bounding boxes into action tubes based on the location and the class-specific score of each bounding box. Though this approach has been successful at achieving a good localization accuracy, it is computation-intensive. High-end GPUs are usually demanded for it to achieve real-time performance. In addition, this approach does not scale well on a large number of action classes. In this work, we present a framework, EScALation, for making spatio-temporal action localization efficient and scalable. Our framework involves two main strategies. One is the frame sampling technique that utilizes the temporal correlation between frames and selects key frame(s) from a temporally correlated set of frames to perform bounding box detection. The other is the class filtering technique that exploits bounding box information to predict the action class prior to linking bounding boxes. We compare EScALation with the state-of-the-art approach on UCF101-24 and J-HMDB-21 datasets. One of our experiments shows EScALation is able to save 72.2% of the time with only 6.1% loss of mAP. In addition, we show that EScALation scales better to a large number of action classes than the state-of-the-art approach.","PeriodicalId":138399,"journal":{"name":"Proceedings of the 12th ACM Multimedia Systems Conference","volume":"42 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-07-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 12th ACM Multimedia Systems Conference","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3458305.3459598","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2

Abstract

Spatio-temporal action localization aims to detect the spatial location and the start/end time of the action in a video. The state-of-the-art approach uses convolutional neural networks to extract possible bounding boxes for the action in each frame and then link bounding boxes into action tubes based on the location and the class-specific score of each bounding box. Though this approach has been successful at achieving a good localization accuracy, it is computation-intensive. High-end GPUs are usually demanded for it to achieve real-time performance. In addition, this approach does not scale well on a large number of action classes. In this work, we present a framework, EScALation, for making spatio-temporal action localization efficient and scalable. Our framework involves two main strategies. One is the frame sampling technique that utilizes the temporal correlation between frames and selects key frame(s) from a temporally correlated set of frames to perform bounding box detection. The other is the class filtering technique that exploits bounding box information to predict the action class prior to linking bounding boxes. We compare EScALation with the state-of-the-art approach on UCF101-24 and J-HMDB-21 datasets. One of our experiments shows EScALation is able to save 72.2% of the time with only 6.1% loss of mAP. In addition, we show that EScALation scales better to a large number of action classes than the state-of-the-art approach.
升级:一个用于高效和可扩展的时空动作定位的框架
时空动作定位的目的是检测视频中动作的空间位置和开始/结束时间。最先进的方法使用卷积神经网络为每帧中的动作提取可能的边界框,然后根据每个边界框的位置和类别特定分数将边界框链接到动作管中。虽然这种方法已经成功地实现了良好的定位精度,但它是计算密集型的。通常需要高端gpu来实现实时性能。此外,这种方法不能很好地扩展到大量的操作类。在这项工作中,我们提出了一个框架,升级,使时空动作定位高效和可扩展。我们的框架包括两个主要策略。一种是帧采样技术,它利用帧之间的时间相关性,从一组时间相关的帧中选择关键帧进行边界框检测。另一种是类过滤技术,它利用边界框信息在链接边界框之前预测操作类。我们在UCF101-24和J-HMDB-21数据集上比较了升级与最先进的方法。我们的一个实验表明,EScALation能够节省72.2%的时间,而mAP的损失只有6.1%。此外,我们还展示了与最先进的方法相比,EScALation可以更好地扩展到大量的操作类。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信