{"title":"基于骨架动作识别的自注意增强动态语义多尺度图卷积网络","authors":"Shihao Liu, Cheng Xu, Songyin Dai, Nuoya Li, Weiguo Pan, Bingxin Xu, Liu Hongzhe","doi":"10.1016/j.imavis.2025.105725","DOIUrl":null,"url":null,"abstract":"<div><div>Skeleton-based action recognition has attracted increasing attention due to its efficiency and robustness in modeling human motion. However, existing graph convolutional approaches often rely on predefined topologies and struggle to capture high-level semantic relations and long-range dependencies. Meanwhile, transformer-based methods, despite their effectiveness in modeling global dependencies, typically overlook local continuity and impose high computational costs. Moreover, current multi-stream fusion strategies commonly ignore low-level complementary cues across modalities. To address these limitations, we propose SAD-MSNet, a Self-Attention enhanced Multi-Scale dynamic semantic graph convolutional network. SAD-MSNet integrates a region-aware multi-scale skeleton simplification strategy to represent actions at different levels of abstraction. It employs a semantic-aware spatial modeling module that constructs dynamic graphs based on node types, edge types, and topological priors, further refined by channel-wise attention and adaptive fusion. For temporal modeling, the network utilizes a six-branch structure that combines standard causal convolution, dilated joint-guided temporal convolutions with varying dilation rates, and a global pooling branch, enabling it to effectively capture both short-term dynamics and long-range temporal semantics. Extensive experiments on NTU RGB+D, NTU RGB+D 120, and N-UCLA demonstrate that SAD-MSNet achieves superior performance compared to state-of-the-art methods, while maintaining a compact and interpretable architecture.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"162 ","pages":"Article 105725"},"PeriodicalIF":4.2000,"publicationDate":"2025-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Self-attention enhanced dynamic semantic multi-scale graph convolutional network for skeleton-based action recognition\",\"authors\":\"Shihao Liu, Cheng Xu, Songyin Dai, Nuoya Li, Weiguo Pan, Bingxin Xu, Liu Hongzhe\",\"doi\":\"10.1016/j.imavis.2025.105725\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Skeleton-based action recognition has attracted increasing attention due to its efficiency and robustness in modeling human motion. However, existing graph convolutional approaches often rely on predefined topologies and struggle to capture high-level semantic relations and long-range dependencies. Meanwhile, transformer-based methods, despite their effectiveness in modeling global dependencies, typically overlook local continuity and impose high computational costs. Moreover, current multi-stream fusion strategies commonly ignore low-level complementary cues across modalities. To address these limitations, we propose SAD-MSNet, a Self-Attention enhanced Multi-Scale dynamic semantic graph convolutional network. SAD-MSNet integrates a region-aware multi-scale skeleton simplification strategy to represent actions at different levels of abstraction. It employs a semantic-aware spatial modeling module that constructs dynamic graphs based on node types, edge types, and topological priors, further refined by channel-wise attention and adaptive fusion. For temporal modeling, the network utilizes a six-branch structure that combines standard causal convolution, dilated joint-guided temporal convolutions with varying dilation rates, and a global pooling branch, enabling it to effectively capture both short-term dynamics and long-range temporal semantics. Extensive experiments on NTU RGB+D, NTU RGB+D 120, and N-UCLA demonstrate that SAD-MSNet achieves superior performance compared to state-of-the-art methods, while maintaining a compact and interpretable architecture.</div></div>\",\"PeriodicalId\":50374,\"journal\":{\"name\":\"Image and Vision Computing\",\"volume\":\"162 \",\"pages\":\"Article 105725\"},\"PeriodicalIF\":4.2000,\"publicationDate\":\"2025-09-04\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Image and Vision Computing\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0262885625003130\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Image and Vision Computing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0262885625003130","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
Skeleton-based action recognition has attracted increasing attention due to its efficiency and robustness in modeling human motion. However, existing graph convolutional approaches often rely on predefined topologies and struggle to capture high-level semantic relations and long-range dependencies. Meanwhile, transformer-based methods, despite their effectiveness in modeling global dependencies, typically overlook local continuity and impose high computational costs. Moreover, current multi-stream fusion strategies commonly ignore low-level complementary cues across modalities. To address these limitations, we propose SAD-MSNet, a Self-Attention enhanced Multi-Scale dynamic semantic graph convolutional network. SAD-MSNet integrates a region-aware multi-scale skeleton simplification strategy to represent actions at different levels of abstraction. It employs a semantic-aware spatial modeling module that constructs dynamic graphs based on node types, edge types, and topological priors, further refined by channel-wise attention and adaptive fusion. For temporal modeling, the network utilizes a six-branch structure that combines standard causal convolution, dilated joint-guided temporal convolutions with varying dilation rates, and a global pooling branch, enabling it to effectively capture both short-term dynamics and long-range temporal semantics. Extensive experiments on NTU RGB+D, NTU RGB+D 120, and N-UCLA demonstrate that SAD-MSNet achieves superior performance compared to state-of-the-art methods, while maintaining a compact and interpretable architecture.
期刊介绍:
Image and Vision Computing has as a primary aim the provision of an effective medium of interchange for the results of high quality theoretical and applied research fundamental to all aspects of image interpretation and computer vision. The journal publishes work that proposes new image interpretation and computer vision methodology or addresses the application of such methods to real world scenes. It seeks to strengthen a deeper understanding in the discipline by encouraging the quantitative comparison and performance evaluation of the proposed methodology. The coverage includes: image interpretation, scene modelling, object recognition and tracking, shape analysis, monitoring and surveillance, active vision and robotic systems, SLAM, biologically-inspired computer vision, motion analysis, stereo vision, document image understanding, character and handwritten text recognition, face and gesture recognition, biometrics, vision-based human-computer interaction, human activity and behavior understanding, data fusion from multiple sensor inputs, image databases.