EScALation: a framework for efficient and scalable spatio-temporal action localization

Proceedings of the 12th ACM Multimedia Systems Conference Pub Date : 2021-07-15 DOI:10.1145/3458305.3459598

Bo Chen, K. Nahrstedt

{"title":"EScALation: a framework for efficient and scalable spatio-temporal action localization","authors":"Bo Chen, K. Nahrstedt","doi":"10.1145/3458305.3459598","DOIUrl":null,"url":null,"abstract":"Spatio-temporal action localization aims to detect the spatial location and the start/end time of the action in a video. The state-of-the-art approach uses convolutional neural networks to extract possible bounding boxes for the action in each frame and then link bounding boxes into action tubes based on the location and the class-specific score of each bounding box. Though this approach has been successful at achieving a good localization accuracy, it is computation-intensive. High-end GPUs are usually demanded for it to achieve real-time performance. In addition, this approach does not scale well on a large number of action classes. In this work, we present a framework, EScALation, for making spatio-temporal action localization efficient and scalable. Our framework involves two main strategies. One is the frame sampling technique that utilizes the temporal correlation between frames and selects key frame(s) from a temporally correlated set of frames to perform bounding box detection. The other is the class filtering technique that exploits bounding box information to predict the action class prior to linking bounding boxes. We compare EScALation with the state-of-the-art approach on UCF101-24 and J-HMDB-21 datasets. One of our experiments shows EScALation is able to save 72.2% of the time with only 6.1% loss of mAP. In addition, we show that EScALation scales better to a large number of action classes than the state-of-the-art approach.","PeriodicalId":138399,"journal":{"name":"Proceedings of the 12th ACM Multimedia Systems Conference","volume":"42 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-07-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 12th ACM Multimedia Systems Conference","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3458305.3459598","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

Abstract

Spatio-temporal action localization aims to detect the spatial location and the start/end time of the action in a video. The state-of-the-art approach uses convolutional neural networks to extract possible bounding boxes for the action in each frame and then link bounding boxes into action tubes based on the location and the class-specific score of each bounding box. Though this approach has been successful at achieving a good localization accuracy, it is computation-intensive. High-end GPUs are usually demanded for it to achieve real-time performance. In addition, this approach does not scale well on a large number of action classes. In this work, we present a framework, EScALation, for making spatio-temporal action localization efficient and scalable. Our framework involves two main strategies. One is the frame sampling technique that utilizes the temporal correlation between frames and selects key frame(s) from a temporally correlated set of frames to perform bounding box detection. The other is the class filtering technique that exploits bounding box information to predict the action class prior to linking bounding boxes. We compare EScALation with the state-of-the-art approach on UCF101-24 and J-HMDB-21 datasets. One of our experiments shows EScALation is able to save 72.2% of the time with only 6.1% loss of mAP. In addition, we show that EScALation scales better to a large number of action classes than the state-of-the-art approach.

查看原文本刊更多论文

升级:一个用于高效和可扩展的时空动作定位的框架

时空动作定位的目的是检测视频中动作的空间位置和开始/结束时间。最先进的方法使用卷积神经网络为每帧中的动作提取可能的边界框，然后根据每个边界框的位置和类别特定分数将边界框链接到动作管中。虽然这种方法已经成功地实现了良好的定位精度，但它是计算密集型的。通常需要高端gpu来实现实时性能。此外，这种方法不能很好地扩展到大量的操作类。在这项工作中，我们提出了一个框架，升级，使时空动作定位高效和可扩展。我们的框架包括两个主要策略。一种是帧采样技术，它利用帧之间的时间相关性，从一组时间相关的帧中选择关键帧进行边界框检测。另一种是类过滤技术，它利用边界框信息在链接边界框之前预测操作类。我们在UCF101-24和J-HMDB-21数据集上比较了升级与最先进的方法。我们的一个实验表明，EScALation能够节省72.2%的时间，而mAP的损失只有6.1%。此外，我们还展示了与最先进的方法相比，EScALation可以更好地扩展到大量的操作类。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 12th ACM Multimedia Systems Conference

自引率

0.00%

发文量