面向动作识别的分割时空骨架图-注意力

IF 9.7 1区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Multimedia Pub Date : 2025-01-27 DOI:10.1109/TMM.2025.3535284

Musrea Abdo Ghaseb;Ahmed Elhayek;Fawaz Alsolami;Abdullah Marish Ali

{"title":"面向动作识别的分割时空骨架图-注意力","authors":"Musrea Abdo Ghaseb;Ahmed Elhayek;Fawaz Alsolami;Abdullah Marish Ali","doi":"10.1109/TMM.2025.3535284","DOIUrl":null,"url":null,"abstract":"Human motion recognition is extremely important for many practical applications in several disciplines, such as surveillance, medicine, sports, gait analysis, and computer graphics. Graph convolutional networks (GCNs) enhance the accuracy and performance of skeleton-based action recognition. However, this approach has difficulties in modeling long-term temporal dependencies. In Addition, the fixed topology of the skeleton graph is not sufficiently robust to extract features for skeleton motions. Although transformers that rely entirely on self-attention have demonstrated great success in modeling global correlations between inputs and outputs, they ignore the local correlations between joints. In this study, we propose a novel segmented spatiotemporal skeleton graph-attention network (S3GAAR) to effectively learn different human actions and concentrate on the most operative part of the human body for each action. The proposed S3GAAR models spatial-temporal features through spatiotemporal attention for each segment to capture short-term temporal dependencies. Owing to several human actions that focus on one or more body parts such as mutual actions, our novel method divides the human skeleton into three segments: superior, inferior, and extremity joints. Our proposed method is designed to extract the features of each segment individually because human actions focus on one or more segments. Moreover, our segmented spatiotemporal graph introduces additional edges between important distant joints in the same segment. The experimental results show that our novel method outperforms state-of-the-art methods up to 1.1% on two large-scale benchmark datasets, NTU-RGB+D 60 and NTU-RGB+D 120.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"3437-3446"},"PeriodicalIF":9.7000,"publicationDate":"2025-01-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"S3GAAR: Segmented Spatiotemporal Skeleton Graph-Attention for Action Recognition\",\"authors\":\"Musrea Abdo Ghaseb;Ahmed Elhayek;Fawaz Alsolami;Abdullah Marish Ali\",\"doi\":\"10.1109/TMM.2025.3535284\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Human motion recognition is extremely important for many practical applications in several disciplines, such as surveillance, medicine, sports, gait analysis, and computer graphics. Graph convolutional networks (GCNs) enhance the accuracy and performance of skeleton-based action recognition. However, this approach has difficulties in modeling long-term temporal dependencies. In Addition, the fixed topology of the skeleton graph is not sufficiently robust to extract features for skeleton motions. Although transformers that rely entirely on self-attention have demonstrated great success in modeling global correlations between inputs and outputs, they ignore the local correlations between joints. In this study, we propose a novel segmented spatiotemporal skeleton graph-attention network (S3GAAR) to effectively learn different human actions and concentrate on the most operative part of the human body for each action. The proposed S3GAAR models spatial-temporal features through spatiotemporal attention for each segment to capture short-term temporal dependencies. Owing to several human actions that focus on one or more body parts such as mutual actions, our novel method divides the human skeleton into three segments: superior, inferior, and extremity joints. Our proposed method is designed to extract the features of each segment individually because human actions focus on one or more segments. Moreover, our segmented spatiotemporal graph introduces additional edges between important distant joints in the same segment. The experimental results show that our novel method outperforms state-of-the-art methods up to 1.1% on two large-scale benchmark datasets, NTU-RGB+D 60 and NTU-RGB+D 120.\",\"PeriodicalId\":13273,\"journal\":{\"name\":\"IEEE Transactions on Multimedia\",\"volume\":\"27 \",\"pages\":\"3437-3446\"},\"PeriodicalIF\":9.7000,\"publicationDate\":\"2025-01-27\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Multimedia\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10855563/\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Multimedia","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10855563/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

摘要

人体运动识别在监视、医学、运动、步态分析和计算机图形学等领域的许多实际应用中都非常重要。图卷积网络（GCNs）提高了基于骨架的动作识别的准确性和性能。然而，这种方法在建模长期时间依赖性方面存在困难。此外，骨架图的固定拓扑结构对提取骨架运动特征的鲁棒性不足。尽管完全依赖自关注的变压器在模拟输入和输出之间的全局相关性方面取得了巨大成功，但它们忽略了关节之间的局部相关性。在这项研究中，我们提出了一种新的分段时空骨架图-注意力网络（S3GAAR），以有效地学习不同的人类动作，并专注于每个动作中人体最有效的部分。提出的S3GAAR通过每个片段的时空注意力来建模时空特征，以捕获短期时间依赖性。由于人类的一些动作集中在一个或多个身体部位，如相互作用，我们的新方法将人类骨骼分为三个部分：上、下和四肢关节。我们提出的方法旨在单独提取每个片段的特征，因为人类的行为集中在一个或多个片段上。此外，我们的分段时空图在同一段的重要远节点之间引入了额外的边。实验结果表明，在NTU-RGB+ d60和NTU-RGB+ d120两个大型基准数据集上，我们的新方法比现有方法的性能高出1.1%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

S3GAAR: Segmented Spatiotemporal Skeleton Graph-Attention for Action Recognition

Human motion recognition is extremely important for many practical applications in several disciplines, such as surveillance, medicine, sports, gait analysis, and computer graphics. Graph convolutional networks (GCNs) enhance the accuracy and performance of skeleton-based action recognition. However, this approach has difficulties in modeling long-term temporal dependencies. In Addition, the fixed topology of the skeleton graph is not sufficiently robust to extract features for skeleton motions. Although transformers that rely entirely on self-attention have demonstrated great success in modeling global correlations between inputs and outputs, they ignore the local correlations between joints. In this study, we propose a novel segmented spatiotemporal skeleton graph-attention network (S3GAAR) to effectively learn different human actions and concentrate on the most operative part of the human body for each action. The proposed S3GAAR models spatial-temporal features through spatiotemporal attention for each segment to capture short-term temporal dependencies. Owing to several human actions that focus on one or more body parts such as mutual actions, our novel method divides the human skeleton into three segments: superior, inferior, and extremity joints. Our proposed method is designed to extract the features of each segment individually because human actions focus on one or more segments. Moreover, our segmented spatiotemporal graph introduces additional edges between important distant joints in the same segment. The experimental results show that our novel method outperforms state-of-the-art methods up to 1.1% on two large-scale benchmark datasets, NTU-RGB+D 60 and NTU-RGB+D 120.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE Transactions on Multimedia 工程技术-电信学

CiteScore

11.70

自引率

11.00%

发文量

576

审稿时长

5.5 months

期刊介绍： The IEEE Transactions on Multimedia delves into diverse aspects of multimedia technology and applications, covering circuits, networking, signal processing, systems, software, and systems integration. The scope aligns with the Fields of Interest of the sponsors, ensuring a comprehensive exploration of research in multimedia.