Semantic information guided multimodal skeleton-based action recognition

IF 14.7 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Information Fusion Pub Date : 2025-05-15 DOI:10.1016/j.inffus.2025.103289

Chenghao Li , Wenlong Liang , Fei Yin , Yahui Zhao , Zhenguo Zhang

{"title":"Semantic information guided multimodal skeleton-based action recognition","authors":"Chenghao Li , Wenlong Liang , Fei Yin , Yahui Zhao , Zhenguo Zhang","doi":"10.1016/j.inffus.2025.103289","DOIUrl":null,"url":null,"abstract":"<div><div>Human skeleton sequences are a crucial data modality for human motion representation. The primary challenge in skeleton-based action recognition lies in the effective capture of spatio-temporal correlations among skeleton joints. However, when the human body interacts with other objects in the background, these spatio-temporal correlations may become less apparent. To tackle this issue, we analyze the semantic information of human actions and propose a Semantic Information Guided Human Skeleton Action Recognition method (ActionGCL), which facilitates the differentiation of skeleton data from different action categories within a latent space. Concretely, we first construct a spatio-temporal action encoder based on graph convolutional neural networks to extract the dependencies among human skeleton sequences. It comprises alternating stacks of modules designed for temporal feature extraction and spatial graph convolution. The temporal feature extraction module integrates multiscale temporal convolutional networks to capture rich inter-frame correlations among nodes, while the spatial graph convolution module adaptively learns a sample-specific topology graph. Subsequently, to leverage the rich semantic information embedded within action labels, we design a multimodal contrastive learning module that simultaneously utilizes both skeleton and textual data. This module optimizes skeleton data in both skeleton-textual directions, employing the abundant semantic information within action labels to guide the training of spatio-temporal action encoders. It facilitates the accurate identification of ambiguous actions that are difficult to discern based solely on spatio-temporal correlations. Experimental results on two prominent action recognition datasets, NTU RGB+D 60 and NTU RGB+D 120, demonstrate that ActionGCL is effective and significantly outperforms other models in recognition accuracy.</div></div>","PeriodicalId":50367,"journal":{"name":"Information Fusion","volume":"123 ","pages":"Article 103289"},"PeriodicalIF":14.7000,"publicationDate":"2025-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information Fusion","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1566253525003628","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Human skeleton sequences are a crucial data modality for human motion representation. The primary challenge in skeleton-based action recognition lies in the effective capture of spatio-temporal correlations among skeleton joints. However, when the human body interacts with other objects in the background, these spatio-temporal correlations may become less apparent. To tackle this issue, we analyze the semantic information of human actions and propose a Semantic Information Guided Human Skeleton Action Recognition method (ActionGCL), which facilitates the differentiation of skeleton data from different action categories within a latent space. Concretely, we first construct a spatio-temporal action encoder based on graph convolutional neural networks to extract the dependencies among human skeleton sequences. It comprises alternating stacks of modules designed for temporal feature extraction and spatial graph convolution. The temporal feature extraction module integrates multiscale temporal convolutional networks to capture rich inter-frame correlations among nodes, while the spatial graph convolution module adaptively learns a sample-specific topology graph. Subsequently, to leverage the rich semantic information embedded within action labels, we design a multimodal contrastive learning module that simultaneously utilizes both skeleton and textual data. This module optimizes skeleton data in both skeleton-textual directions, employing the abundant semantic information within action labels to guide the training of spatio-temporal action encoders. It facilitates the accurate identification of ambiguous actions that are difficult to discern based solely on spatio-temporal correlations. Experimental results on two prominent action recognition datasets, NTU RGB+D 60 and NTU RGB+D 120, demonstrate that ActionGCL is effective and significantly outperforms other models in recognition accuracy.

查看原文本刊更多论文

语义信息引导基于多模态骨架的动作识别

人体骨骼序列是人体运动表征的重要数据形式。基于骨骼的动作识别的主要挑战在于有效捕获骨骼关节之间的时空相关性。然而，当人体与背景中的其他物体相互作用时，这些时空相关性可能变得不那么明显。为了解决这一问题，我们分析了人体动作的语义信息，提出了一种语义信息引导的人体骨骼动作识别方法（ActionGCL），该方法有助于在潜在空间内区分不同动作类别的骨骼数据。具体而言，我们首先构建了一个基于图卷积神经网络的时空动作编码器来提取人体骨骼序列之间的依赖关系。它由交替堆叠的模块组成，用于时间特征提取和空间图卷积。时间特征提取模块集成多尺度时间卷积网络，捕获节点间丰富的帧间关联，空间图卷积模块自适应学习特定样本的拓扑图。随后，为了利用嵌入在动作标签中的丰富语义信息，我们设计了一个同时利用骨架和文本数据的多模态对比学习模块。该模块从骨架-文本两个方向对骨架数据进行优化，利用动作标签中丰富的语义信息来指导时空动作编码器的训练。它有助于准确识别仅基于时空相关性难以辨别的模糊行为。在NTU RGB+D 60和NTU RGB+D 120两个著名的动作识别数据集上的实验结果表明，ActionGCL是有效的，在识别精度上显著优于其他模型。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Information Fusion 工程技术-计算机：理论方法

CiteScore

33.20

自引率

4.30%

发文量

161

审稿时长

7.9 months

期刊介绍： Information Fusion serves as a central platform for showcasing advancements in multi-sensor, multi-source, multi-process information fusion, fostering collaboration among diverse disciplines driving its progress. It is the leading outlet for sharing research and development in this field, focusing on architectures, algorithms, and applications. Papers dealing with fundamental theoretical analyses as well as those demonstrating their application to real-world problems will be welcome.