MVP-Shot: Multi-Velocity Progressive-Alignment Framework for Few-Shot Action Recognition

IF 9.7 1区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Multimedia Pub Date : 2025-07-04 DOI:10.1109/TMM.2025.3586118

Hongyu Qu;Rui Yan;Xiangbo Shu;Hailiang Gao;Peng Huang;Guosen Xie

{"title":"MVP-Shot: Multi-Velocity Progressive-Alignment Framework for Few-Shot Action Recognition","authors":"Hongyu Qu;Rui Yan;Xiangbo Shu;Hailiang Gao;Peng Huang;Guosen Xie","doi":"10.1109/TMM.2025.3586118","DOIUrl":null,"url":null,"abstract":"Recent few-shot action recognition (FSAR) methods typically perform semantic matching on learned discriminative features to achieve promising performance. However, most FSAR methods focus on single-scale (e.g., frame-level, segment-level, etc.) feature alignment, which ignores that human actions with the same semantic may appear at different velocities. To this end, we develop a novel Multi-Velocity Progressive-alignment (MVP-Shot) framework to progressively learn and align semantic-related action features at multi-velocity levels. Concretely, a Multi-Velocity Feature Alignment (MVFA) module is designed to measure the similarity between features from support and query videos with different velocity scales and then merge all similarity scores in a residual fashion. To avoid the multiple velocity features deviating from the underlying motion semantic, our proposed Progressive Semantic-Tailored Interaction (PSTI) module injects velocity-tailored text information into the video feature via feature interaction on channel and temporal domains at different velocities. The above two modules compensate for each other to make more accurate query sample predictions under the few-shot settings. Experimental results show our method outperforms current state-of-the-art methods on multiple standard few-shot benchmarks (<italic>i.e.</i>, HMDB51, UCF101, Kinetics, SSv2-full, and SSv2-small).","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"6593-6605"},"PeriodicalIF":9.7000,"publicationDate":"2025-07-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Multimedia","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/11071918/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

Recent few-shot action recognition (FSAR) methods typically perform semantic matching on learned discriminative features to achieve promising performance. However, most FSAR methods focus on single-scale (e.g., frame-level, segment-level, etc.) feature alignment, which ignores that human actions with the same semantic may appear at different velocities. To this end, we develop a novel Multi-Velocity Progressive-alignment (MVP-Shot) framework to progressively learn and align semantic-related action features at multi-velocity levels. Concretely, a Multi-Velocity Feature Alignment (MVFA) module is designed to measure the similarity between features from support and query videos with different velocity scales and then merge all similarity scores in a residual fashion. To avoid the multiple velocity features deviating from the underlying motion semantic, our proposed Progressive Semantic-Tailored Interaction (PSTI) module injects velocity-tailored text information into the video feature via feature interaction on channel and temporal domains at different velocities. The above two modules compensate for each other to make more accurate query sample predictions under the few-shot settings. Experimental results show our method outperforms current state-of-the-art methods on multiple standard few-shot benchmarks (i.e., HMDB51, UCF101, Kinetics, SSv2-full, and SSv2-small).

查看原文本刊更多论文

MVP-Shot：用于少射动作识别的多速度渐进式对齐框架

目前的小镜头动作识别（FSAR）方法通常对学习到的判别特征进行语义匹配，以达到较好的识别效果。然而，大多数FSAR方法专注于单尺度（如帧级、段级等）特征对齐，忽略了具有相同语义的人类动作可能以不同的速度出现。为此，我们开发了一种新的多速度渐进式对齐（MVP-Shot）框架，以逐步学习和对齐多速度水平的语义相关动作特征。具体而言，设计了一个多速度特征对齐（Multi-Velocity Feature Alignment， MVFA）模块，用于测量不同速度尺度的支持视频和查询视频的特征之间的相似性，然后以残差方式合并所有相似分数。为了避免多个速度特征偏离底层运动语义，我们提出的渐进式语义定制交互（Progressive semantic -tailored Interaction， PSTI）模块通过在不同速度的信道域和时域上的特征交互，将速度定制的文本信息注入视频特征中。以上两个模块相互补偿，在少镜头设置下做出更准确的查询样本预测。实验结果表明，我们的方法在多个标准的少量基准测试（即HMDB51、UCF101、Kinetics、SSv2-full和SSv2-small）上优于当前最先进的方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Transactions on Multimedia 工程技术-电信学

CiteScore

11.70

自引率

11.00%

发文量

576

审稿时长

5.5 months

期刊介绍： The IEEE Transactions on Multimedia delves into diverse aspects of multimedia technology and applications, covering circuits, networking, signal processing, systems, software, and systems integration. The scope aligns with the Fields of Interest of the sponsors, ensuring a comprehensive exploration of research in multimedia.