{"title":"调制运动感知的视觉语言表示在少镜头动作识别中的应用","authors":"Pengfei Fang;Qiang Xu;Zixuan Lin;Hui Xue","doi":"10.1109/TCSVT.2025.3557009","DOIUrl":null,"url":null,"abstract":"This paper focuses on few-shot action recognition (FSAR), where the machine is required to understand human actions, with each only seeing a few video samples. Even with only a few explorations, the most cutting-edge methods employ the action textual features, pre-trained by a visual-language model (VLM), as a cue to optimize video prototypes. However, the action textual features used in these methods are generated from a static prompt, causing the network to overlook rich motion cues within videos. To tackle this issue, we propose a novel framework, namely, <underline>mo</u>tion-aware <underline>v</u>isual-language r<underline>e</u>presentation modulation <underline>net</u>work (MoveNet). The proposed MoveNet utilizes dynamic motion cues within videos to integrate motion-aware textual and visual feature representations, as a way to modulate the video prototypes. In doing so, a long short motion aggregation module (LSMAM) is first proposed to capture diverse motion cues. Having the motion cues at hand, a motion-conditional prompting module (MCPM) utilizes the motion cues as conditions to boost the semantic associations between textual features and action classes. One further develops a motion-guided visual refinement module (MVRM) that adopts motion cues as guidance in enhancing local frame features. The proposed components compensate for each other and contribute to significant performance gains over the FASR task. Thorough experiments on five standard benchmarks demonstrate the effectiveness of the proposed method, considerably outperforming current state-of-the-art methods.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 9","pages":"8614-8626"},"PeriodicalIF":11.1000,"publicationDate":"2025-04-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"On Modulating Motion-Aware Visual-Language Representation for Few-Shot Action Recognition\",\"authors\":\"Pengfei Fang;Qiang Xu;Zixuan Lin;Hui Xue\",\"doi\":\"10.1109/TCSVT.2025.3557009\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This paper focuses on few-shot action recognition (FSAR), where the machine is required to understand human actions, with each only seeing a few video samples. Even with only a few explorations, the most cutting-edge methods employ the action textual features, pre-trained by a visual-language model (VLM), as a cue to optimize video prototypes. However, the action textual features used in these methods are generated from a static prompt, causing the network to overlook rich motion cues within videos. To tackle this issue, we propose a novel framework, namely, <underline>mo</u>tion-aware <underline>v</u>isual-language r<underline>e</u>presentation modulation <underline>net</u>work (MoveNet). The proposed MoveNet utilizes dynamic motion cues within videos to integrate motion-aware textual and visual feature representations, as a way to modulate the video prototypes. In doing so, a long short motion aggregation module (LSMAM) is first proposed to capture diverse motion cues. Having the motion cues at hand, a motion-conditional prompting module (MCPM) utilizes the motion cues as conditions to boost the semantic associations between textual features and action classes. One further develops a motion-guided visual refinement module (MVRM) that adopts motion cues as guidance in enhancing local frame features. The proposed components compensate for each other and contribute to significant performance gains over the FASR task. Thorough experiments on five standard benchmarks demonstrate the effectiveness of the proposed method, considerably outperforming current state-of-the-art methods.\",\"PeriodicalId\":13082,\"journal\":{\"name\":\"IEEE Transactions on Circuits and Systems for Video Technology\",\"volume\":\"35 9\",\"pages\":\"8614-8626\"},\"PeriodicalIF\":11.1000,\"publicationDate\":\"2025-04-02\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Circuits and Systems for Video Technology\",\"FirstCategoryId\":\"5\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10947524/\",\"RegionNum\":1,\"RegionCategory\":\"工程技术\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"ENGINEERING, ELECTRICAL & ELECTRONIC\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Circuits and Systems for Video Technology","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10947524/","RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
On Modulating Motion-Aware Visual-Language Representation for Few-Shot Action Recognition
This paper focuses on few-shot action recognition (FSAR), where the machine is required to understand human actions, with each only seeing a few video samples. Even with only a few explorations, the most cutting-edge methods employ the action textual features, pre-trained by a visual-language model (VLM), as a cue to optimize video prototypes. However, the action textual features used in these methods are generated from a static prompt, causing the network to overlook rich motion cues within videos. To tackle this issue, we propose a novel framework, namely, motion-aware visual-language representation modulation network (MoveNet). The proposed MoveNet utilizes dynamic motion cues within videos to integrate motion-aware textual and visual feature representations, as a way to modulate the video prototypes. In doing so, a long short motion aggregation module (LSMAM) is first proposed to capture diverse motion cues. Having the motion cues at hand, a motion-conditional prompting module (MCPM) utilizes the motion cues as conditions to boost the semantic associations between textual features and action classes. One further develops a motion-guided visual refinement module (MVRM) that adopts motion cues as guidance in enhancing local frame features. The proposed components compensate for each other and contribute to significant performance gains over the FASR task. Thorough experiments on five standard benchmarks demonstrate the effectiveness of the proposed method, considerably outperforming current state-of-the-art methods.
期刊介绍:
The IEEE Transactions on Circuits and Systems for Video Technology (TCSVT) is dedicated to covering all aspects of video technologies from a circuits and systems perspective. We encourage submissions of general, theoretical, and application-oriented papers related to image and video acquisition, representation, presentation, and display. Additionally, we welcome contributions in areas such as processing, filtering, and transforms; analysis and synthesis; learning and understanding; compression, transmission, communication, and networking; as well as storage, retrieval, indexing, and search. Furthermore, papers focusing on hardware and software design and implementation are highly valued. Join us in advancing the field of video technology through innovative research and insights.