一种高效动作识别的静态-动态组合框架

IF 8.9 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

IEEE transactions on neural networks and learning systems Pub Date : 2025-01-10 DOI:10.1109/TNNLS.2024.3525187

Xu Chen;Yahong Han;Changlin Li;Xiaojun Chang;Yifan Sun;Yi Yang

{"title":"一种高效动作识别的静态-动态组合框架","authors":"Xu Chen;Yahong Han;Changlin Li;Xiaojun Chang;Yifan Sun;Yi Yang","doi":"10.1109/TNNLS.2024.3525187","DOIUrl":null,"url":null,"abstract":"The dynamic inference, which adaptively allocates computational budgets for different samples, is a prevalent approach for achieving efficient action recognition. Current studies primarily focus on a data-efficient regime that reduces spatial or temporal redundancy, or their combination, by selecting partial video data, such as clips, frames, or patches. However, these approaches often utilize fixed and computationally expensive networks. From a different perspective, this article introduces a novel model-efficient regime that addresses network redundancy by dynamically selecting a partial network in real time. Specifically, we acknowledge that different channels of the neural network inherently contain redundant semantics either spatially or temporally. Therefore, by decreasing the width of the network, we can enhance efficiency while compromising the feature capacity. To strike a balance between efficiency and capacity, we propose the static-dynamic composition (SDCOM) framework, which comprises a static network with a fixed width and a dynamic network with a flexible width. In this framework, the static network extracts the primary feature with essential semantics from the input frame and simultaneously evaluates the gap toward achieving a comprehensive feature representation. Based on these evaluation results, the dynamic network activates a minimal width to extract a supplementary feature that fills the identified gap. We optimize the dynamic feature extraction through the employment of the slimmable network mechanism and a novel meta-learning scheme introduced in this article. Empirical analysis reveals that by combining the primary feature with an extremely lightweight supplementary feature, we can accurately recognize a large majority of frames (76%~92%). As a result, our proposed SDCOM significantly enhances recognition efficiency. For instance, on ActivityNet, FCVID, and Mini-Kinetics datasets, SDCOM saves 90% of the baseline’s floating point operations (FLOPs) while achieving comparable or superior accuracy when compared with state-of-the-art methods.","PeriodicalId":13303,"journal":{"name":"IEEE transactions on neural networks and learning systems","volume":"36 8","pages":"14664-14677"},"PeriodicalIF":8.9000,"publicationDate":"2025-01-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A Static-Dynamic Composition Framework for Efficient Action Recognition\",\"authors\":\"Xu Chen;Yahong Han;Changlin Li;Xiaojun Chang;Yifan Sun;Yi Yang\",\"doi\":\"10.1109/TNNLS.2024.3525187\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The dynamic inference, which adaptively allocates computational budgets for different samples, is a prevalent approach for achieving efficient action recognition. Current studies primarily focus on a data-efficient regime that reduces spatial or temporal redundancy, or their combination, by selecting partial video data, such as clips, frames, or patches. However, these approaches often utilize fixed and computationally expensive networks. From a different perspective, this article introduces a novel model-efficient regime that addresses network redundancy by dynamically selecting a partial network in real time. Specifically, we acknowledge that different channels of the neural network inherently contain redundant semantics either spatially or temporally. Therefore, by decreasing the width of the network, we can enhance efficiency while compromising the feature capacity. To strike a balance between efficiency and capacity, we propose the static-dynamic composition (SDCOM) framework, which comprises a static network with a fixed width and a dynamic network with a flexible width. In this framework, the static network extracts the primary feature with essential semantics from the input frame and simultaneously evaluates the gap toward achieving a comprehensive feature representation. Based on these evaluation results, the dynamic network activates a minimal width to extract a supplementary feature that fills the identified gap. We optimize the dynamic feature extraction through the employment of the slimmable network mechanism and a novel meta-learning scheme introduced in this article. Empirical analysis reveals that by combining the primary feature with an extremely lightweight supplementary feature, we can accurately recognize a large majority of frames (76%~92%). As a result, our proposed SDCOM significantly enhances recognition efficiency. For instance, on ActivityNet, FCVID, and Mini-Kinetics datasets, SDCOM saves 90% of the baseline’s floating point operations (FLOPs) while achieving comparable or superior accuracy when compared with state-of-the-art methods.\",\"PeriodicalId\":13303,\"journal\":{\"name\":\"IEEE transactions on neural networks and learning systems\",\"volume\":\"36 8\",\"pages\":\"14664-14677\"},\"PeriodicalIF\":8.9000,\"publicationDate\":\"2025-01-10\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE transactions on neural networks and learning systems\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10836808/\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on neural networks and learning systems","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10836808/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

动态推理可以自适应地为不同的样本分配计算预算，是实现高效动作识别的常用方法。目前的研究主要集中在通过选择部分视频数据（如片段、帧或补丁）来减少空间或时间冗余或其组合的数据效率机制上。然而，这些方法通常使用固定的和计算昂贵的网络。本文从另一个角度介绍了一种新的模型效率机制，通过实时动态选择部分网络来解决网络冗余问题。具体来说，我们承认神经网络的不同通道固有地包含空间或时间上的冗余语义。因此，通过减小网络的宽度，我们可以在牺牲特征容量的同时提高效率。为了在效率和容量之间取得平衡，我们提出了静态-动态组合（SDCOM）框架，它包括一个固定宽度的静态网络和一个灵活宽度的动态网络。在该框架中，静态网络从输入帧中提取具有基本语义的主要特征，同时评估实现全面特征表示的差距。基于这些评价结果，动态网络激活一个最小宽度来提取一个补充特征来填补识别的空白。本文通过引入可伸缩网络机制和元学习方案，对动态特征提取进行了优化。实证分析表明，通过将主特征与极轻量的辅助特征相结合，我们可以准确识别绝大多数帧（76%~92%）。结果表明，我们提出的SDCOM显著提高了识别效率。例如，在ActivityNet、FCVID和Mini-Kinetics数据集上，SDCOM节省了90%的基准浮点运算（FLOPs），同时与最先进的方法相比，实现了相当或更高的精度。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

A Static-Dynamic Composition Framework for Efficient Action Recognition

The dynamic inference, which adaptively allocates computational budgets for different samples, is a prevalent approach for achieving efficient action recognition. Current studies primarily focus on a data-efficient regime that reduces spatial or temporal redundancy, or their combination, by selecting partial video data, such as clips, frames, or patches. However, these approaches often utilize fixed and computationally expensive networks. From a different perspective, this article introduces a novel model-efficient regime that addresses network redundancy by dynamically selecting a partial network in real time. Specifically, we acknowledge that different channels of the neural network inherently contain redundant semantics either spatially or temporally. Therefore, by decreasing the width of the network, we can enhance efficiency while compromising the feature capacity. To strike a balance between efficiency and capacity, we propose the static-dynamic composition (SDCOM) framework, which comprises a static network with a fixed width and a dynamic network with a flexible width. In this framework, the static network extracts the primary feature with essential semantics from the input frame and simultaneously evaluates the gap toward achieving a comprehensive feature representation. Based on these evaluation results, the dynamic network activates a minimal width to extract a supplementary feature that fills the identified gap. We optimize the dynamic feature extraction through the employment of the slimmable network mechanism and a novel meta-learning scheme introduced in this article. Empirical analysis reveals that by combining the primary feature with an extremely lightweight supplementary feature, we can accurately recognize a large majority of frames (76%~92%). As a result, our proposed SDCOM significantly enhances recognition efficiency. For instance, on ActivityNet, FCVID, and Mini-Kinetics datasets, SDCOM saves 90% of the baseline’s floating point operations (FLOPs) while achieving comparable or superior accuracy when compared with state-of-the-art methods.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE transactions on neural networks and learning systems COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE-COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

CiteScore

23.80

自引率

9.60%

发文量

2102

审稿时长

3-8 weeks

期刊介绍： The focus of IEEE Transactions on Neural Networks and Learning Systems is to present scholarly articles discussing the theory, design, and applications of neural networks as well as other learning systems. The journal primarily highlights technical and scientific research in this domain.