联合图像实例时空注意的少镜头动作识别

IF 4.3 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Vision and Image Understanding Pub Date : 2025-03-01 DOI:10.1016/j.cviu.2025.104322

Zefeng Qian , Chongyang Zhang , Yifei Huang , Gang Wang , Jiangyong Ying

{"title":"联合图像实例时空注意的少镜头动作识别","authors":"Zefeng Qian , Chongyang Zhang , Yifei Huang , Gang Wang , Jiangyong Ying","doi":"10.1016/j.cviu.2025.104322","DOIUrl":null,"url":null,"abstract":"<div><div>Few-shot Action Recognition (FSAR) constitutes a crucial challenge in computer vision, entailing the recognition of actions from a limited set of examples. Recent approaches mainly focus on employing image-level features to construct temporal dependencies and generate prototypes for each action category. However, a considerable number of these methods utilize mainly image-level features that incorporate background noise and focus insufficiently on real foreground (action-related instances), thereby compromising the recognition capability, particularly in the few-shot scenario. To tackle this issue, we propose a novel joint Image-Instance level Spatial–temporal attention approach (I<span><math><msup><mrow></mrow><mrow><mn>2</mn></mrow></msup></math></span>ST) for Few-shot Action Recognition. The core concept of I<span><math><msup><mrow></mrow><mrow><mn>2</mn></mrow></msup></math></span>ST is to perceive the action-related instances and integrate them with image features via spatial–temporal attention. Specifically, I<span><math><msup><mrow></mrow><mrow><mn>2</mn></mrow></msup></math></span>ST consists of two key components: Action-related Instance Perception and Joint Image-Instance Spatial–temporal Attention. Given the basic representations from the feature extractor, the Action-related Instance Perception is introduced to perceive action-related instances under the guidance of a text-guided segmentation model. Subsequently, the Joint Image-Instance Spatial–temporal Attention is used to construct the feature dependency between instances and images. To enhance the prototype representations of different categories of videos, a pair of spatial–temporal attention sub-modules is introduced to combine image features and instance embeddings across both temporal and spatial dimensions, and a global fusion sub-module is utilized to aggregate global contextual information, then robust action video prototypes can be formed. Finally, based on the video prototype, a Global–Local Prototype Matching is performed for reliable few-shot video matching. In this manner, our proposed I<span><math><msup><mrow></mrow><mrow><mn>2</mn></mrow></msup></math></span>ST can effectively exploit the foreground instance-level cues and model more accurate spatial–temporal relationships for the complex few-shot video recognition scenarios. Extensive experiments across standard few-shot benchmarks demonstrate that the proposed framework outperforms existing methods and achieves state-of-the-art performance under various few-shot settings.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"254 ","pages":"Article 104322"},"PeriodicalIF":4.3000,"publicationDate":"2025-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Joint image-instance spatial–temporal attention for few-shot action recognition\",\"authors\":\"Zefeng Qian , Chongyang Zhang , Yifei Huang , Gang Wang , Jiangyong Ying\",\"doi\":\"10.1016/j.cviu.2025.104322\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Few-shot Action Recognition (FSAR) constitutes a crucial challenge in computer vision, entailing the recognition of actions from a limited set of examples. Recent approaches mainly focus on employing image-level features to construct temporal dependencies and generate prototypes for each action category. However, a considerable number of these methods utilize mainly image-level features that incorporate background noise and focus insufficiently on real foreground (action-related instances), thereby compromising the recognition capability, particularly in the few-shot scenario. To tackle this issue, we propose a novel joint Image-Instance level Spatial–temporal attention approach (I<span><math><msup><mrow></mrow><mrow><mn>2</mn></mrow></msup></math></span>ST) for Few-shot Action Recognition. The core concept of I<span><math><msup><mrow></mrow><mrow><mn>2</mn></mrow></msup></math></span>ST is to perceive the action-related instances and integrate them with image features via spatial–temporal attention. Specifically, I<span><math><msup><mrow></mrow><mrow><mn>2</mn></mrow></msup></math></span>ST consists of two key components: Action-related Instance Perception and Joint Image-Instance Spatial–temporal Attention. Given the basic representations from the feature extractor, the Action-related Instance Perception is introduced to perceive action-related instances under the guidance of a text-guided segmentation model. Subsequently, the Joint Image-Instance Spatial–temporal Attention is used to construct the feature dependency between instances and images. To enhance the prototype representations of different categories of videos, a pair of spatial–temporal attention sub-modules is introduced to combine image features and instance embeddings across both temporal and spatial dimensions, and a global fusion sub-module is utilized to aggregate global contextual information, then robust action video prototypes can be formed. Finally, based on the video prototype, a Global–Local Prototype Matching is performed for reliable few-shot video matching. In this manner, our proposed I<span><math><msup><mrow></mrow><mrow><mn>2</mn></mrow></msup></math></span>ST can effectively exploit the foreground instance-level cues and model more accurate spatial–temporal relationships for the complex few-shot video recognition scenarios. Extensive experiments across standard few-shot benchmarks demonstrate that the proposed framework outperforms existing methods and achieves state-of-the-art performance under various few-shot settings.</div></div>\",\"PeriodicalId\":50633,\"journal\":{\"name\":\"Computer Vision and Image Understanding\",\"volume\":\"254 \",\"pages\":\"Article 104322\"},\"PeriodicalIF\":4.3000,\"publicationDate\":\"2025-03-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Computer Vision and Image Understanding\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S1077314225000451\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Vision and Image Understanding","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1077314225000451","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

少镜头动作识别（FSAR）是计算机视觉中的一个关键挑战，它需要从有限的一组示例中识别动作。最近的方法主要集中在使用图像级特征来构建时间依赖关系并为每个动作类别生成原型。然而，相当多的这些方法主要利用图像级特征，这些特征包含背景噪声，并且对真实前景（与动作相关的实例）的关注不足，从而损害了识别能力，特别是在少数镜头场景中。为了解决这个问题，我们提出了一种新的联合图像-实例级时空注意方法（I2ST）用于少镜头动作识别。I2ST的核心概念是通过时空注意感知与动作相关的实例，并将其与图像特征相结合。具体来说，I2ST包括两个关键组成部分：动作相关实例感知和联合图像-实例时空注意。给定特征提取器的基本表示，在文本引导分割模型的指导下，引入动作相关实例感知来感知动作相关实例。随后，利用联合图像-实例时空注意构造实例与图像之间的特征依赖关系。为了增强不同类别视频的原型表征，引入了一对时空关注子模块，将图像特征和实例嵌入在时间和空间维度上结合起来，利用全局融合子模块聚合全局上下文信息，形成鲁棒的动作视频原型。最后，在视频原型的基础上，进行全局-局部原型匹配，实现可靠的少镜头视频匹配。通过这种方式，我们提出的I2ST可以有效地利用前景实例级线索，为复杂的少镜头视频识别场景建立更准确的时空关系模型。在标准的几次基准测试中进行的大量实验表明，所提出的框架优于现有方法，并在各种几次设置下实现了最先进的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Joint image-instance spatial–temporal attention for few-shot action recognition

Few-shot Action Recognition (FSAR) constitutes a crucial challenge in computer vision, entailing the recognition of actions from a limited set of examples. Recent approaches mainly focus on employing image-level features to construct temporal dependencies and generate prototypes for each action category. However, a considerable number of these methods utilize mainly image-level features that incorporate background noise and focus insufficiently on real foreground (action-related instances), thereby compromising the recognition capability, particularly in the few-shot scenario. To tackle this issue, we propose a novel joint Image-Instance level Spatial–temporal attention approach (I

^{2}

ST) for Few-shot Action Recognition. The core concept of I

^{2}

ST is to perceive the action-related instances and integrate them with image features via spatial–temporal attention. Specifically, I

^{2}

ST consists of two key components: Action-related Instance Perception and Joint Image-Instance Spatial–temporal Attention. Given the basic representations from the feature extractor, the Action-related Instance Perception is introduced to perceive action-related instances under the guidance of a text-guided segmentation model. Subsequently, the Joint Image-Instance Spatial–temporal Attention is used to construct the feature dependency between instances and images. To enhance the prototype representations of different categories of videos, a pair of spatial–temporal attention sub-modules is introduced to combine image features and instance embeddings across both temporal and spatial dimensions, and a global fusion sub-module is utilized to aggregate global contextual information, then robust action video prototypes can be formed. Finally, based on the video prototype, a Global–Local Prototype Matching is performed for reliable few-shot video matching. In this manner, our proposed I

^{2}

ST can effectively exploit the foreground instance-level cues and model more accurate spatial–temporal relationships for the complex few-shot video recognition scenarios. Extensive experiments across standard few-shot benchmarks demonstrate that the proposed framework outperforms existing methods and achieves state-of-the-art performance under various few-shot settings.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Computer Vision and Image Understanding 工程技术-工程：电子与电气

CiteScore

7.80

自引率

4.40%

发文量

112

审稿时长

79 days

期刊介绍： The central focus of this journal is the computer analysis of pictorial information. Computer Vision and Image Understanding publishes papers covering all aspects of image analysis from the low-level, iconic processes of early vision to the high-level, symbolic processes of recognition and interpretation. A wide range of topics in the image understanding area is covered, including papers offering insights that differ from predominant views. Research Areas Include: • Theory • Early vision • Data structures and representations • Shape • Range • Motion • Matching and recognition • Architecture and languages • Vision systems