Beyond-Skeleton: Zero-shot Skeleton Action Recognition enhanced by supplementary RGB visual information

IF 7.5 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Expert Systems with Applications Pub Date : 2025-02-15 DOI:10.1016/j.eswa.2025.126814

Hongjie Liu , Yingchun Niu , Kun Zeng , Chun Liu , Mengjie Hu , Qing Song

{"title":"Beyond-Skeleton: Zero-shot Skeleton Action Recognition enhanced by supplementary RGB visual information","authors":"Hongjie Liu , Yingchun Niu , Kun Zeng , Chun Liu , Mengjie Hu , Qing Song","doi":"10.1016/j.eswa.2025.126814","DOIUrl":null,"url":null,"abstract":"<div><div>Zero-shot action recognition (ZSAR) recognizes action categories that have not appeared during the training process and has garnered widespread attention due to its potential to save costs in retraining and data annotation. We observed that the existing ZSAR method based on skeleton sequences only uses human posture information in the skeleton sequence, lacks discriminative semantic representation in some similar behavior recognition, and lacks effective interaction between different modalities, resulting in unsatisfactory performance and limited applications of the ZSAR. To solve these problems, we propose a novel method, called Beyond-Skeleton zero-shot Learning (BSZSL), which is used to enhance zero-shot Skeleton Action Recognition. Firstly, a multi-prompt learning strategy is introduced. It utilizes prompt information to guide the model to simultaneously learn complementary semantic information related to behavior categories from both skeleton sequences and RGB information, making the visual feature information more expressive. Specifically, it employs a pre-trained multimodal model to extract prior knowledge related to behaviors from RGB and then guides the skeleton sequence features using this knowledge. This enhances the complementary features of both RGB and skeleton modalities. Secondly, to constrain the mapping relationship of different modal feature information, a Contrastive Clustering (CC) module is designed. This module emphasizes the similarity of features within the same category while increasing the differences in feature mapping between different categories. Finally, evaluating our method on the NTU-60 and NTU-120 datasets with multi-split settings, the result demonstrates that our method achieves state-of-the-art performance in both zero-shot learning (ZSL) and generalized zero-shot learning (GZSL) settings.</div></div>","PeriodicalId":50461,"journal":{"name":"Expert Systems with Applications","volume":"273 ","pages":"Article 126814"},"PeriodicalIF":7.5000,"publicationDate":"2025-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Expert Systems with Applications","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0957417425004361","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Zero-shot action recognition (ZSAR) recognizes action categories that have not appeared during the training process and has garnered widespread attention due to its potential to save costs in retraining and data annotation. We observed that the existing ZSAR method based on skeleton sequences only uses human posture information in the skeleton sequence, lacks discriminative semantic representation in some similar behavior recognition, and lacks effective interaction between different modalities, resulting in unsatisfactory performance and limited applications of the ZSAR. To solve these problems, we propose a novel method, called Beyond-Skeleton zero-shot Learning (BSZSL), which is used to enhance zero-shot Skeleton Action Recognition. Firstly, a multi-prompt learning strategy is introduced. It utilizes prompt information to guide the model to simultaneously learn complementary semantic information related to behavior categories from both skeleton sequences and RGB information, making the visual feature information more expressive. Specifically, it employs a pre-trained multimodal model to extract prior knowledge related to behaviors from RGB and then guides the skeleton sequence features using this knowledge. This enhances the complementary features of both RGB and skeleton modalities. Secondly, to constrain the mapping relationship of different modal feature information, a Contrastive Clustering (CC) module is designed. This module emphasizes the similarity of features within the same category while increasing the differences in feature mapping between different categories. Finally, evaluating our method on the NTU-60 and NTU-120 datasets with multi-split settings, the result demonstrates that our method achieves state-of-the-art performance in both zero-shot learning (ZSL) and generalized zero-shot learning (GZSL) settings.

查看原文本刊更多论文

超越骨架：通过补充 RGB 视觉信息增强零镜头骨架动作识别能力

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Expert Systems with Applications 工程技术-工程：电子与电气

CiteScore

13.80

自引率

10.60%

发文量

2045

审稿时长

8.7 months

期刊介绍： Expert Systems With Applications is an international journal dedicated to the exchange of information on expert and intelligent systems used globally in industry, government, and universities. The journal emphasizes original papers covering the design, development, testing, implementation, and management of these systems, offering practical guidelines. It spans various sectors such as finance, engineering, marketing, law, project management, information management, medicine, and more. The journal also welcomes papers on multi-agent systems, knowledge management, neural networks, knowledge discovery, data mining, and other related areas, excluding applications to military/defense systems.