Skimming and Scanning for Efficient Action Recognition in Untrimmed Videos

2021 14th International Congress on Image and Signal Processing, BioMedical Engineering and Informatics (CISP-BMEI) Pub Date : 2021-10-23 DOI:10.1109/CISP-BMEI53629.2021.9624415

Yunyan Hong, Ailing Zeng, Min Li, Cewu Lu, Li Jiang, Qiang Xu

{"title":"Skimming and Scanning for Efficient Action Recognition in Untrimmed Videos","authors":"Yunyan Hong, Ailing Zeng, Min Li, Cewu Lu, Li Jiang, Qiang Xu","doi":"10.1109/CISP-BMEI53629.2021.9624415","DOIUrl":null,"url":null,"abstract":"Video action recognition (VAR) aims to classify videos into a predefined set of classes, which is a primary task of video understanding. We mainly focus on the VAR of untrimmed videos because they are most common videos in real-life scenes. Untrimmed videos have redundant and diverse clips containing contextual information, so sampling the clips is essential. Recently, some works attempt to train a generic model to select the $N$ most representative clips. However, it is difficult to model the complex relations from intra-class clips and inter-class videos within a single model and fixed selected number, and the entanglement of multiple relations is also hard to explain. Thus, instead of “only look once”, we argue “divide and conquer” strategy will be more suitable in untrimmed VAR. Inspired by the speed reading mechanism, we propose a simple yet effective clip-level solution based on skim-scan techniques. Specifically, the proposed Skim-Scan framework first skims the entire video and drops those uninformative and misleading clips. For the remaining clips, it scans clips with diverse features gradually to drop redundant clips but cover essential content. The above strategies can adaptively select the necessary clips according to the difficulty of the different videos. In order to further cut computational overhead, we observe the similar statistical expression between lightweight and heavy networks. Thus, we explore the combination of them to trade off the computational complexity and performance. Comprehensive experiments are performed on ActivityNet and mini-FCVID datasets, and results demonstrate that our solution surpasses the state-of-the-art performance in terms of accuracy and efficiency.","PeriodicalId":131256,"journal":{"name":"2021 14th International Congress on Image and Signal Processing, BioMedical Engineering and Informatics (CISP-BMEI)","volume":"68 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-10-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 14th International Congress on Image and Signal Processing, BioMedical Engineering and Informatics (CISP-BMEI)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CISP-BMEI53629.2021.9624415","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Video action recognition (VAR) aims to classify videos into a predefined set of classes, which is a primary task of video understanding. We mainly focus on the VAR of untrimmed videos because they are most common videos in real-life scenes. Untrimmed videos have redundant and diverse clips containing contextual information, so sampling the clips is essential. Recently, some works attempt to train a generic model to select the $N$ most representative clips. However, it is difficult to model the complex relations from intra-class clips and inter-class videos within a single model and fixed selected number, and the entanglement of multiple relations is also hard to explain. Thus, instead of “only look once”, we argue “divide and conquer” strategy will be more suitable in untrimmed VAR. Inspired by the speed reading mechanism, we propose a simple yet effective clip-level solution based on skim-scan techniques. Specifically, the proposed Skim-Scan framework first skims the entire video and drops those uninformative and misleading clips. For the remaining clips, it scans clips with diverse features gradually to drop redundant clips but cover essential content. The above strategies can adaptively select the necessary clips according to the difficulty of the different videos. In order to further cut computational overhead, we observe the similar statistical expression between lightweight and heavy networks. Thus, we explore the combination of them to trade off the computational complexity and performance. Comprehensive experiments are performed on ActivityNet and mini-FCVID datasets, and results demonstrate that our solution surpasses the state-of-the-art performance in terms of accuracy and efficiency.

查看原文本刊更多论文

在未修剪视频中高效动作识别的略读和扫描

视频动作识别(VAR)的目的是将视频分类为一组预定义的类，这是视频理解的主要任务。我们主要关注未修剪视频的VAR，因为它们是现实场景中最常见的视频。未修剪的视频有冗余和不同的剪辑包含上下文信息，所以采样剪辑是必不可少的。最近，一些作品尝试训练一个通用模型来选择$N$最具代表性的片段。然而，类内视频和类间视频的复杂关系很难在单一模型和固定的选择数量中建模，多重关系的纠缠也难以解释。因此，我们认为“分而治之”的策略更适合于未修剪的VAR，而不是“只看一次”。受快速阅读机制的启发，我们提出了一种简单而有效的基于略读扫描技术的剪辑级解决方案。具体来说，所提出的略读扫描框架首先略读整个视频，并删除那些没有信息和误导性的片段。对于剩余的片段，它会逐步扫描具有不同特征的片段，删除冗余的片段，但覆盖必要的内容。上述策略可以根据不同视频的难度自适应选择必要的片段。为了进一步减少计算开销，我们观察到在轻量级和重型网络之间有相似的统计表达式。因此，我们探索它们的组合，以权衡计算复杂性和性能。在ActivityNet和mini-FCVID数据集上进行了全面的实验，结果表明我们的解决方案在准确性和效率方面都超过了目前最先进的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2021 14th International Congress on Image and Signal Processing, BioMedical Engineering and Informatics (CISP-BMEI)

自引率

0.00%

发文量