Hybrid attentive prototypical network for few-shot action recognition

IF 5 2区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Complex & Intelligent Systems Pub Date : 2024-08-19 DOI:10.1007/s40747-024-01571-4

Zanxi Ruan, Yingmei Wei, Yanming Guo, Yuxiang Xie

{"title":"Hybrid attentive prototypical network for few-shot action recognition","authors":"Zanxi Ruan, Yingmei Wei, Yanming Guo, Yuxiang Xie","doi":"10.1007/s40747-024-01571-4","DOIUrl":null,"url":null,"abstract":"<p>Most previous few-shot action recognition works tend to process video temporal and spatial features separately, resulting in insufficient extraction of comprehensive features. In this paper, a novel hybrid attentive prototypical network (HAPN) framework for few-shot action recognition is proposed. Distinguished by its joint processing of temporal and spatial information, the HAPN framework strategically manipulates these dimensions from feature extraction to the attention module, consequently enhancing its ability to perform action recognition tasks. Our framework utilizes the R(2+1)D backbone network, coupling the extraction of integrated temporal and spatial features to ensure a comprehensive understanding of video content. Additionally, our framework introduces the novel Residual Tri-dimensional Attention (ResTriDA) mechanism, specifically designed to augment feature information across the temporal, spatial, and channel dimensions. ResTriDA dynamically enhances crucial aspects of video features by amplifying significant channel-wise features for action distinction, accentuating spatial details vital for capturing the essence of actions within frames, and emphasizing temporal dynamics to capture movement over time. We further propose a prototypical attentive matching module (PAM) built on the concept of metric learning to resolve the overfitting issue common in few-shot tasks. We evaluate our HAPN framework on three classical few-shot action recognition datasets: Kinetics-100, UCF101, and HMDB51. The results indicate that our framework significantly outperformed state-of-the-art methods. Notably, the 1-shot task, demonstrated an increase of 9.8% in accuracy on UCF101 and improvements of 3.9% on HMDB51 and 12.4% on Kinetics-100. These gains confirm the robustness and effectiveness of our approach in leveraging limited data for precise action recognition.</p>","PeriodicalId":10524,"journal":{"name":"Complex & Intelligent Systems","volume":"7 1","pages":""},"PeriodicalIF":5.0000,"publicationDate":"2024-08-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Complex & Intelligent Systems","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s40747-024-01571-4","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Most previous few-shot action recognition works tend to process video temporal and spatial features separately, resulting in insufficient extraction of comprehensive features. In this paper, a novel hybrid attentive prototypical network (HAPN) framework for few-shot action recognition is proposed. Distinguished by its joint processing of temporal and spatial information, the HAPN framework strategically manipulates these dimensions from feature extraction to the attention module, consequently enhancing its ability to perform action recognition tasks. Our framework utilizes the R(2+1)D backbone network, coupling the extraction of integrated temporal and spatial features to ensure a comprehensive understanding of video content. Additionally, our framework introduces the novel Residual Tri-dimensional Attention (ResTriDA) mechanism, specifically designed to augment feature information across the temporal, spatial, and channel dimensions. ResTriDA dynamically enhances crucial aspects of video features by amplifying significant channel-wise features for action distinction, accentuating spatial details vital for capturing the essence of actions within frames, and emphasizing temporal dynamics to capture movement over time. We further propose a prototypical attentive matching module (PAM) built on the concept of metric learning to resolve the overfitting issue common in few-shot tasks. We evaluate our HAPN framework on three classical few-shot action recognition datasets: Kinetics-100, UCF101, and HMDB51. The results indicate that our framework significantly outperformed state-of-the-art methods. Notably, the 1-shot task, demonstrated an increase of 9.8% in accuracy on UCF101 and improvements of 3.9% on HMDB51 and 12.4% on Kinetics-100. These gains confirm the robustness and effectiveness of our approach in leveraging limited data for precise action recognition.

Abstract Image

查看原文本刊更多论文

用于少镜头动作识别的混合注意原型网络

之前的大多数少镜头动作识别工作往往将视频的时间和空间特征分开处理，导致提取的综合特征不够充分。本文提出了一种新颖的混合殷勤原型网络（HAPN）框架，用于少镜头动作识别。HAPN 框架与众不同之处在于它能联合处理时间和空间信息，从特征提取到注意力模块都能战略性地处理这些维度，从而增强其执行动作识别任务的能力。我们的框架利用 R(2+1)D 骨干网络，将时间和空间综合特征的提取结合起来，以确保对视频内容的全面理解。此外，我们的框架还引入了新颖的残差三维注意力（ResTriDA）机制，专门用于增强跨时间、空间和通道维度的特征信息。ResTriDA 可动态增强视频特征的关键方面，包括放大重要的通道特征以区分动作，强调空间细节以捕捉帧内动作的本质，以及强调时间动态以捕捉随时间变化的运动。我们进一步提出了基于度量学习概念的原型注意匹配模块 (PAM)，以解决少镜头任务中常见的过拟合问题。我们在三个经典的少镜头动作识别数据集上评估了我们的 HAPN 框架：Kinetics-100、UCF101 和 HMDB51。结果表明，我们的框架明显优于最先进的方法。值得注意的是，在单发任务中，UCF101 的准确率提高了 9.8%，HMDB51 提高了 3.9%，Kinetics-100 提高了 12.4%。这些进步证实了我们的方法在利用有限数据进行精确动作识别方面的稳健性和有效性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Complex & Intelligent Systems COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE-

CiteScore

9.60

自引率

10.30%

发文量

297

期刊介绍： Complex & Intelligent Systems aims to provide a forum for presenting and discussing novel approaches, tools and techniques meant for attaining a cross-fertilization between the broad fields of complex systems, computational simulation, and intelligent analytics and visualization. The transdisciplinary research that the journal focuses on will expand the boundaries of our understanding by investigating the principles and processes that underlie many of the most profound problems facing society today.