Manet：用于视频动作识别的运动感知网络

IF 4.6 2区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Complex & Intelligent Systems Pub Date : 2025-02-06 DOI:10.1007/s40747-024-01774-9

Xiaoyang Li, Wenzhu Yang, Kanglin Wang, Tiebiao Wang, Chen Zhang

{"title":"Manet：用于视频动作识别的运动感知网络","authors":"Xiaoyang Li, Wenzhu Yang, Kanglin Wang, Tiebiao Wang, Chen Zhang","doi":"10.1007/s40747-024-01774-9","DOIUrl":null,"url":null,"abstract":"<p>Video action recognition is a fundamental task in video understanding. Actions in videos may vary at different speeds or scales, and it is difficult to cope with a wide variety of actions by relying on a single spatio-temporal scale to extract features. To address this problem, we propose a Motion-Aware Network (MANet), which includes three key modules: (1) Local Motion Encoding Module (LMEM) for capturing local motion features, (2) Spatio-Temporal Excitation Module (STEM) for extracting multi-granular motion information, and (3) Multiple Temporal Aggregation Module (MTAM) for modeling multi-scale temporal information. The MANet, equipped with these modules, can capture multi-granularity spatio-temporal cues. We conducted extensive experiments on five mainstream datasets, Something-Something V1 & V2, Jester, Diving48, and UCF-101, to validate the effectiveness of MANet. The MANet achieves competitive performance on Something-Something V1 (52.5%), Something-Something V2 (63.6%), Jester (95.9%), Diving48 (81.8%) and UCF-101 (86.2%). In addition, we visualize the feature representation of the MANet using Grad-CAM to validate its effectiveness.\n</p>","PeriodicalId":10524,"journal":{"name":"Complex & Intelligent Systems","volume":"8 1","pages":""},"PeriodicalIF":4.6000,"publicationDate":"2025-02-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Manet: motion-aware network for video action recognition\",\"authors\":\"Xiaoyang Li, Wenzhu Yang, Kanglin Wang, Tiebiao Wang, Chen Zhang\",\"doi\":\"10.1007/s40747-024-01774-9\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p>Video action recognition is a fundamental task in video understanding. Actions in videos may vary at different speeds or scales, and it is difficult to cope with a wide variety of actions by relying on a single spatio-temporal scale to extract features. To address this problem, we propose a Motion-Aware Network (MANet), which includes three key modules: (1) Local Motion Encoding Module (LMEM) for capturing local motion features, (2) Spatio-Temporal Excitation Module (STEM) for extracting multi-granular motion information, and (3) Multiple Temporal Aggregation Module (MTAM) for modeling multi-scale temporal information. The MANet, equipped with these modules, can capture multi-granularity spatio-temporal cues. We conducted extensive experiments on five mainstream datasets, Something-Something V1 & V2, Jester, Diving48, and UCF-101, to validate the effectiveness of MANet. The MANet achieves competitive performance on Something-Something V1 (52.5%), Something-Something V2 (63.6%), Jester (95.9%), Diving48 (81.8%) and UCF-101 (86.2%). In addition, we visualize the feature representation of the MANet using Grad-CAM to validate its effectiveness.\\n</p>\",\"PeriodicalId\":10524,\"journal\":{\"name\":\"Complex & Intelligent Systems\",\"volume\":\"8 1\",\"pages\":\"\"},\"PeriodicalIF\":4.6000,\"publicationDate\":\"2025-02-06\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Complex & Intelligent Systems\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/10.1007/s40747-024-01774-9\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Complex & Intelligent Systems","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s40747-024-01774-9","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

视频动作识别是视频理解中的一项基础性工作。视频中的动作可能会以不同的速度或尺度变化，依靠单一的时空尺度提取特征很难处理各种各样的动作。为了解决这一问题，我们提出了一个运动感知网络（MANet），该网络包括三个关键模块：(1)局部运动编码模块（LMEM），用于捕获局部运动特征；(2)时空激励模块（STEM），用于提取多颗粒运动信息；(3)多时间聚合模块（MTAM），用于建模多尺度时间信息。配备这些模块的MANet可以捕获多粒度的时空线索。我们在5个主流数据集Something-Something V1 &；V2, Jester， Diving48和UCF-101，以验证MANet的有效性。MANet在Something-Something V1（52.5%）、Something-Something V2（63.6%）、Jester（95.9%）、Diving48（81.8%）和UCF-101（86.2%）上取得了具有竞争力的表现。此外，我们使用Grad-CAM可视化了MANet的特征表示，以验证其有效性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Manet: motion-aware network for video action recognition

Video action recognition is a fundamental task in video understanding. Actions in videos may vary at different speeds or scales, and it is difficult to cope with a wide variety of actions by relying on a single spatio-temporal scale to extract features. To address this problem, we propose a Motion-Aware Network (MANet), which includes three key modules: (1) Local Motion Encoding Module (LMEM) for capturing local motion features, (2) Spatio-Temporal Excitation Module (STEM) for extracting multi-granular motion information, and (3) Multiple Temporal Aggregation Module (MTAM) for modeling multi-scale temporal information. The MANet, equipped with these modules, can capture multi-granularity spatio-temporal cues. We conducted extensive experiments on five mainstream datasets, Something-Something V1 & V2, Jester, Diving48, and UCF-101, to validate the effectiveness of MANet. The MANet achieves competitive performance on Something-Something V1 (52.5%), Something-Something V2 (63.6%), Jester (95.9%), Diving48 (81.8%) and UCF-101 (86.2%). In addition, we visualize the feature representation of the MANet using Grad-CAM to validate its effectiveness.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Complex & Intelligent Systems COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE-

CiteScore

9.60

自引率

10.30%

发文量

297

期刊介绍： Complex & Intelligent Systems aims to provide a forum for presenting and discussing novel approaches, tools and techniques meant for attaining a cross-fertilization between the broad fields of complex systems, computational simulation, and intelligent analytics and visualization. The transdisciplinary research that the journal focuses on will expand the boundaries of our understanding by investigating the principles and processes that underlie many of the most profound problems facing society today.