{"title":"基于多镜头动作识别的高效时空建模和文本增强原型","authors":"Qian Zhang, Shuo Yan, Mingwen Shao, Hong Liang","doi":"10.1016/j.neucom.2025.130119","DOIUrl":null,"url":null,"abstract":"<div><div>Few-Shot Action Recognition (FSAR) aims to classify new action categories accurately with only a limited number of labeled samples. Current methods face challenges in capturing spatiotemporal dynamics and integrating multimodal information effectively. This work presents a new framework that improves FSAR performance by improving spatiotemporal modeling and integrating cross-modal semantics. To capture complex spatiotemporal relationships in videos, we introduce two complementary modules: Temporal Enhancement Adaptation (TEA), which enhances temporal modeling capability, and Spatio-Temporal Fusion Adaptation (STFA), which integrates spatial and temporal features for better representations. Additionally, we propose the Text-Enhanced Prototype Module (TEPM), which strengthens prototype representations by fusing textual and visual features at multiple levels, improving the discriminability and generalization of prototypes. The experiments show that our approach achieves competitive performance on various benchmark datasets, confirming its effectiveness in FSAR.</div></div>","PeriodicalId":19268,"journal":{"name":"Neurocomputing","volume":"638 ","pages":"Article 130119"},"PeriodicalIF":5.5000,"publicationDate":"2025-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Efficient spatio-temporal modeling and text-enhanced prototype for few-shot action recognition\",\"authors\":\"Qian Zhang, Shuo Yan, Mingwen Shao, Hong Liang\",\"doi\":\"10.1016/j.neucom.2025.130119\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Few-Shot Action Recognition (FSAR) aims to classify new action categories accurately with only a limited number of labeled samples. Current methods face challenges in capturing spatiotemporal dynamics and integrating multimodal information effectively. This work presents a new framework that improves FSAR performance by improving spatiotemporal modeling and integrating cross-modal semantics. To capture complex spatiotemporal relationships in videos, we introduce two complementary modules: Temporal Enhancement Adaptation (TEA), which enhances temporal modeling capability, and Spatio-Temporal Fusion Adaptation (STFA), which integrates spatial and temporal features for better representations. Additionally, we propose the Text-Enhanced Prototype Module (TEPM), which strengthens prototype representations by fusing textual and visual features at multiple levels, improving the discriminability and generalization of prototypes. The experiments show that our approach achieves competitive performance on various benchmark datasets, confirming its effectiveness in FSAR.</div></div>\",\"PeriodicalId\":19268,\"journal\":{\"name\":\"Neurocomputing\",\"volume\":\"638 \",\"pages\":\"Article 130119\"},\"PeriodicalIF\":5.5000,\"publicationDate\":\"2025-04-03\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Neurocomputing\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S092523122500791X\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Neurocomputing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S092523122500791X","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
Efficient spatio-temporal modeling and text-enhanced prototype for few-shot action recognition
Few-Shot Action Recognition (FSAR) aims to classify new action categories accurately with only a limited number of labeled samples. Current methods face challenges in capturing spatiotemporal dynamics and integrating multimodal information effectively. This work presents a new framework that improves FSAR performance by improving spatiotemporal modeling and integrating cross-modal semantics. To capture complex spatiotemporal relationships in videos, we introduce two complementary modules: Temporal Enhancement Adaptation (TEA), which enhances temporal modeling capability, and Spatio-Temporal Fusion Adaptation (STFA), which integrates spatial and temporal features for better representations. Additionally, we propose the Text-Enhanced Prototype Module (TEPM), which strengthens prototype representations by fusing textual and visual features at multiple levels, improving the discriminability and generalization of prototypes. The experiments show that our approach achieves competitive performance on various benchmark datasets, confirming its effectiveness in FSAR.
期刊介绍:
Neurocomputing publishes articles describing recent fundamental contributions in the field of neurocomputing. Neurocomputing theory, practice and applications are the essential topics being covered.