Xu Chen;Yahong Han;Changlin Li;Xiaojun Chang;Yifan Sun;Yi Yang
{"title":"一种高效动作识别的静态-动态组合框架","authors":"Xu Chen;Yahong Han;Changlin Li;Xiaojun Chang;Yifan Sun;Yi Yang","doi":"10.1109/TNNLS.2024.3525187","DOIUrl":null,"url":null,"abstract":"The dynamic inference, which adaptively allocates computational budgets for different samples, is a prevalent approach for achieving efficient action recognition. Current studies primarily focus on a data-efficient regime that reduces spatial or temporal redundancy, or their combination, by selecting partial video data, such as clips, frames, or patches. However, these approaches often utilize fixed and computationally expensive networks. From a different perspective, this article introduces a novel model-efficient regime that addresses network redundancy by dynamically selecting a partial network in real time. Specifically, we acknowledge that different channels of the neural network inherently contain redundant semantics either spatially or temporally. Therefore, by decreasing the width of the network, we can enhance efficiency while compromising the feature capacity. To strike a balance between efficiency and capacity, we propose the static-dynamic composition (SDCOM) framework, which comprises a static network with a fixed width and a dynamic network with a flexible width. In this framework, the static network extracts the primary feature with essential semantics from the input frame and simultaneously evaluates the gap toward achieving a comprehensive feature representation. Based on these evaluation results, the dynamic network activates a minimal width to extract a supplementary feature that fills the identified gap. We optimize the dynamic feature extraction through the employment of the slimmable network mechanism and a novel meta-learning scheme introduced in this article. Empirical analysis reveals that by combining the primary feature with an extremely lightweight supplementary feature, we can accurately recognize a large majority of frames (76%~92%). As a result, our proposed SDCOM significantly enhances recognition efficiency. For instance, on ActivityNet, FCVID, and Mini-Kinetics datasets, SDCOM saves 90% of the baseline’s floating point operations (FLOPs) while achieving comparable or superior accuracy when compared with state-of-the-art methods.","PeriodicalId":13303,"journal":{"name":"IEEE transactions on neural networks and learning systems","volume":"36 8","pages":"14664-14677"},"PeriodicalIF":8.9000,"publicationDate":"2025-01-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A Static-Dynamic Composition Framework for Efficient Action Recognition\",\"authors\":\"Xu Chen;Yahong Han;Changlin Li;Xiaojun Chang;Yifan Sun;Yi Yang\",\"doi\":\"10.1109/TNNLS.2024.3525187\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The dynamic inference, which adaptively allocates computational budgets for different samples, is a prevalent approach for achieving efficient action recognition. Current studies primarily focus on a data-efficient regime that reduces spatial or temporal redundancy, or their combination, by selecting partial video data, such as clips, frames, or patches. However, these approaches often utilize fixed and computationally expensive networks. From a different perspective, this article introduces a novel model-efficient regime that addresses network redundancy by dynamically selecting a partial network in real time. Specifically, we acknowledge that different channels of the neural network inherently contain redundant semantics either spatially or temporally. Therefore, by decreasing the width of the network, we can enhance efficiency while compromising the feature capacity. To strike a balance between efficiency and capacity, we propose the static-dynamic composition (SDCOM) framework, which comprises a static network with a fixed width and a dynamic network with a flexible width. In this framework, the static network extracts the primary feature with essential semantics from the input frame and simultaneously evaluates the gap toward achieving a comprehensive feature representation. Based on these evaluation results, the dynamic network activates a minimal width to extract a supplementary feature that fills the identified gap. We optimize the dynamic feature extraction through the employment of the slimmable network mechanism and a novel meta-learning scheme introduced in this article. Empirical analysis reveals that by combining the primary feature with an extremely lightweight supplementary feature, we can accurately recognize a large majority of frames (76%~92%). As a result, our proposed SDCOM significantly enhances recognition efficiency. For instance, on ActivityNet, FCVID, and Mini-Kinetics datasets, SDCOM saves 90% of the baseline’s floating point operations (FLOPs) while achieving comparable or superior accuracy when compared with state-of-the-art methods.\",\"PeriodicalId\":13303,\"journal\":{\"name\":\"IEEE transactions on neural networks and learning systems\",\"volume\":\"36 8\",\"pages\":\"14664-14677\"},\"PeriodicalIF\":8.9000,\"publicationDate\":\"2025-01-10\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE transactions on neural networks and learning systems\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10836808/\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on neural networks and learning systems","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10836808/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
A Static-Dynamic Composition Framework for Efficient Action Recognition
The dynamic inference, which adaptively allocates computational budgets for different samples, is a prevalent approach for achieving efficient action recognition. Current studies primarily focus on a data-efficient regime that reduces spatial or temporal redundancy, or their combination, by selecting partial video data, such as clips, frames, or patches. However, these approaches often utilize fixed and computationally expensive networks. From a different perspective, this article introduces a novel model-efficient regime that addresses network redundancy by dynamically selecting a partial network in real time. Specifically, we acknowledge that different channels of the neural network inherently contain redundant semantics either spatially or temporally. Therefore, by decreasing the width of the network, we can enhance efficiency while compromising the feature capacity. To strike a balance between efficiency and capacity, we propose the static-dynamic composition (SDCOM) framework, which comprises a static network with a fixed width and a dynamic network with a flexible width. In this framework, the static network extracts the primary feature with essential semantics from the input frame and simultaneously evaluates the gap toward achieving a comprehensive feature representation. Based on these evaluation results, the dynamic network activates a minimal width to extract a supplementary feature that fills the identified gap. We optimize the dynamic feature extraction through the employment of the slimmable network mechanism and a novel meta-learning scheme introduced in this article. Empirical analysis reveals that by combining the primary feature with an extremely lightweight supplementary feature, we can accurately recognize a large majority of frames (76%~92%). As a result, our proposed SDCOM significantly enhances recognition efficiency. For instance, on ActivityNet, FCVID, and Mini-Kinetics datasets, SDCOM saves 90% of the baseline’s floating point operations (FLOPs) while achieving comparable or superior accuracy when compared with state-of-the-art methods.
期刊介绍:
The focus of IEEE Transactions on Neural Networks and Learning Systems is to present scholarly articles discussing the theory, design, and applications of neural networks as well as other learning systems. The journal primarily highlights technical and scientific research in this domain.