{"title":"Semantic information guided multimodal skeleton-based action recognition","authors":"Chenghao Li , Wenlong Liang , Fei Yin , Yahui Zhao , Zhenguo Zhang","doi":"10.1016/j.inffus.2025.103289","DOIUrl":null,"url":null,"abstract":"<div><div>Human skeleton sequences are a crucial data modality for human motion representation. The primary challenge in skeleton-based action recognition lies in the effective capture of spatio-temporal correlations among skeleton joints. However, when the human body interacts with other objects in the background, these spatio-temporal correlations may become less apparent. To tackle this issue, we analyze the semantic information of human actions and propose a Semantic Information Guided Human Skeleton Action Recognition method (ActionGCL), which facilitates the differentiation of skeleton data from different action categories within a latent space. Concretely, we first construct a spatio-temporal action encoder based on graph convolutional neural networks to extract the dependencies among human skeleton sequences. It comprises alternating stacks of modules designed for temporal feature extraction and spatial graph convolution. The temporal feature extraction module integrates multiscale temporal convolutional networks to capture rich inter-frame correlations among nodes, while the spatial graph convolution module adaptively learns a sample-specific topology graph. Subsequently, to leverage the rich semantic information embedded within action labels, we design a multimodal contrastive learning module that simultaneously utilizes both skeleton and textual data. This module optimizes skeleton data in both skeleton-textual directions, employing the abundant semantic information within action labels to guide the training of spatio-temporal action encoders. It facilitates the accurate identification of ambiguous actions that are difficult to discern based solely on spatio-temporal correlations. Experimental results on two prominent action recognition datasets, NTU RGB+D 60 and NTU RGB+D 120, demonstrate that ActionGCL is effective and significantly outperforms other models in recognition accuracy.</div></div>","PeriodicalId":50367,"journal":{"name":"Information Fusion","volume":"123 ","pages":"Article 103289"},"PeriodicalIF":14.7000,"publicationDate":"2025-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information Fusion","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1566253525003628","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Human skeleton sequences are a crucial data modality for human motion representation. The primary challenge in skeleton-based action recognition lies in the effective capture of spatio-temporal correlations among skeleton joints. However, when the human body interacts with other objects in the background, these spatio-temporal correlations may become less apparent. To tackle this issue, we analyze the semantic information of human actions and propose a Semantic Information Guided Human Skeleton Action Recognition method (ActionGCL), which facilitates the differentiation of skeleton data from different action categories within a latent space. Concretely, we first construct a spatio-temporal action encoder based on graph convolutional neural networks to extract the dependencies among human skeleton sequences. It comprises alternating stacks of modules designed for temporal feature extraction and spatial graph convolution. The temporal feature extraction module integrates multiscale temporal convolutional networks to capture rich inter-frame correlations among nodes, while the spatial graph convolution module adaptively learns a sample-specific topology graph. Subsequently, to leverage the rich semantic information embedded within action labels, we design a multimodal contrastive learning module that simultaneously utilizes both skeleton and textual data. This module optimizes skeleton data in both skeleton-textual directions, employing the abundant semantic information within action labels to guide the training of spatio-temporal action encoders. It facilitates the accurate identification of ambiguous actions that are difficult to discern based solely on spatio-temporal correlations. Experimental results on two prominent action recognition datasets, NTU RGB+D 60 and NTU RGB+D 120, demonstrate that ActionGCL is effective and significantly outperforms other models in recognition accuracy.
期刊介绍:
Information Fusion serves as a central platform for showcasing advancements in multi-sensor, multi-source, multi-process information fusion, fostering collaboration among diverse disciplines driving its progress. It is the leading outlet for sharing research and development in this field, focusing on architectures, algorithms, and applications. Papers dealing with fundamental theoretical analyses as well as those demonstrating their application to real-world problems will be welcome.