{"title":"Hierarchical Aggregated Graph Neural Network for Skeleton-Based Action Recognition","authors":"Pei Geng;Xuequan Lu;Wanqing Li;Lei Lyu","doi":"10.1109/TMM.2024.3428330","DOIUrl":null,"url":null,"abstract":"Supervised human action recognition methods based on skeleton data have achieved impressive performance recently. However, many current works emphasize the design of different contrastive strategies to gain stronger supervised signals, ignoring the crucial role of the model's encoder in encoding fine-grained action representations. Our key insight is that a superior skeleton encoder can effectively exploit the fine-grained dependencies between different skeleton information (e.g., joint, bone, angle) in mining more discriminative fine-grained features. In this paper, we devise an innovative hierarchical aggregated graph neural network (HA-GNN) that involves several core components. In particular, the proposed hierarchical graph convolution (HGC) module learns the complementary semantic information among joint, bone, and angle in a hierarchical manner. The designed pyramid attention fusion mechanism (PAFM) fuses the skeleton features successively to compensate for the action representations obtained by the HGC. We use the multi-scale temporal convolution (MSTC) module to enrich the expression capability of temporal features. In addition, to learn more comprehensive semantic representations of the skeleton, we construct a multi-task learning framework with simple contrastive learning and design the learnable data-enhanced strategy to acquire different data representations. Extensive experiments on NTU RGB+D 60/120, NW-UCLA, Kinetics-400, UAV-Human, and PKUMMD datasets prove that the proposed HA-GNN without contrastive learning achieves state-of-the-art performance in skeleton-based action recognition, and it achieves even better results with contrastive learning.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"26 ","pages":"11003-11017"},"PeriodicalIF":8.4000,"publicationDate":"2024-07-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Multimedia","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10598383/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0
Abstract
Supervised human action recognition methods based on skeleton data have achieved impressive performance recently. However, many current works emphasize the design of different contrastive strategies to gain stronger supervised signals, ignoring the crucial role of the model's encoder in encoding fine-grained action representations. Our key insight is that a superior skeleton encoder can effectively exploit the fine-grained dependencies between different skeleton information (e.g., joint, bone, angle) in mining more discriminative fine-grained features. In this paper, we devise an innovative hierarchical aggregated graph neural network (HA-GNN) that involves several core components. In particular, the proposed hierarchical graph convolution (HGC) module learns the complementary semantic information among joint, bone, and angle in a hierarchical manner. The designed pyramid attention fusion mechanism (PAFM) fuses the skeleton features successively to compensate for the action representations obtained by the HGC. We use the multi-scale temporal convolution (MSTC) module to enrich the expression capability of temporal features. In addition, to learn more comprehensive semantic representations of the skeleton, we construct a multi-task learning framework with simple contrastive learning and design the learnable data-enhanced strategy to acquire different data representations. Extensive experiments on NTU RGB+D 60/120, NW-UCLA, Kinetics-400, UAV-Human, and PKUMMD datasets prove that the proposed HA-GNN without contrastive learning achieves state-of-the-art performance in skeleton-based action recognition, and it achieves even better results with contrastive learning.
期刊介绍:
The IEEE Transactions on Multimedia delves into diverse aspects of multimedia technology and applications, covering circuits, networking, signal processing, systems, software, and systems integration. The scope aligns with the Fields of Interest of the sponsors, ensuring a comprehensive exploration of research in multimedia.