Fusion of Temporal Transformer and Spatial Graph Convolutional Network for 3-D Skeleton-Parts-Based Human Motion Prediction

IF 3.5 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

IEEE Transactions on Human-Machine Systems Pub Date : 2024-09-11 DOI:10.1109/THMS.2024.3452133

Mayank Lovanshi;Vivek Tiwari;Rajesh Ingle;Swati Jain

{"title":"Fusion of Temporal Transformer and Spatial Graph Convolutional Network for 3-D Skeleton-Parts-Based Human Motion Prediction","authors":"Mayank Lovanshi;Vivek Tiwari;Rajesh Ingle;Swati Jain","doi":"10.1109/THMS.2024.3452133","DOIUrl":null,"url":null,"abstract":"The field of human motion prediction has gained prominence, finding applications in various domains such as intelligent surveillance and human–robot interaction. However, predicting full-body human motion poses challenges in capturing joint interactions, handling diverse movement patterns, managing occlusions, and ensuring real-time performance. To address these challenges, the proposed model adopts a skeleton-parted strategy to dissect the skeleton structure, enhancing coordination and fusion between body parts. This novel method combines transformer-enabled graph convolutional networks for predicting human motion in 3-D skeleton data. It integrates a temporal transformer (T-Transformer) for comprehensive temporal feature extraction and a spatial graph convolutional network (S-GCN) for capturing spatial characteristics of human motion. The model's performance is evaluated on two comprehensive human motion datasets, Human3.6M and CMU motion capture (CMU Mocap), containing numerous videos encompassing short and long human motion sequences. Results indicate that the proposed model outperforms state-of-the-art methods on both datasets, significantly improving the average mean per joint positional error (avg-MPJPE) by 3.50% and 11.45% for short-term and long-term motion prediction, respectively. Similarly, on the CMU Mocap dataset, it achieves avg-MPJPE improvements of 2.69% and 1.05% for short-term and long-term motion prediction, respectively, demonstrating its superior accuracy in predicting human motion over extended periods. The study also investigates the impact of different numbers of T-Transformers and S-GCNs and explores the specific roles and contributions of the T-Transformer, S-GCN, and cross-part components.","PeriodicalId":48916,"journal":{"name":"IEEE Transactions on Human-Machine Systems","volume":"54 6","pages":"788-797"},"PeriodicalIF":3.5000,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Human-Machine Systems","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10678758/","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

The field of human motion prediction has gained prominence, finding applications in various domains such as intelligent surveillance and human–robot interaction. However, predicting full-body human motion poses challenges in capturing joint interactions, handling diverse movement patterns, managing occlusions, and ensuring real-time performance. To address these challenges, the proposed model adopts a skeleton-parted strategy to dissect the skeleton structure, enhancing coordination and fusion between body parts. This novel method combines transformer-enabled graph convolutional networks for predicting human motion in 3-D skeleton data. It integrates a temporal transformer (T-Transformer) for comprehensive temporal feature extraction and a spatial graph convolutional network (S-GCN) for capturing spatial characteristics of human motion. The model's performance is evaluated on two comprehensive human motion datasets, Human3.6M and CMU motion capture (CMU Mocap), containing numerous videos encompassing short and long human motion sequences. Results indicate that the proposed model outperforms state-of-the-art methods on both datasets, significantly improving the average mean per joint positional error (avg-MPJPE) by 3.50% and 11.45% for short-term and long-term motion prediction, respectively. Similarly, on the CMU Mocap dataset, it achieves avg-MPJPE improvements of 2.69% and 1.05% for short-term and long-term motion prediction, respectively, demonstrating its superior accuracy in predicting human motion over extended periods. The study also investigates the impact of different numbers of T-Transformers and S-GCNs and explores the specific roles and contributions of the T-Transformer, S-GCN, and cross-part components.

查看原文本刊更多论文

融合时空变换器和空间图卷积网络，实现基于三维骨骼-部件的人体运动预测

人体运动预测领域的地位日益突出，在智能监控和人机交互等多个领域都有应用。然而，预测全身人体运动在捕捉关节互动、处理各种运动模式、管理遮挡物和确保实时性方面存在挑战。为了应对这些挑战，所提出的模型采用骨架分割策略来剖析骨架结构，从而增强身体各部分之间的协调与融合。这种新方法结合了支持变换器的图卷积网络，用于预测三维骨骼数据中的人体运动。它集成了用于全面时间特征提取的时间变换器（T-Transformer）和用于捕捉人体运动空间特征的空间图卷积网络（S-GCN）。该模型的性能在两个全面的人体运动数据集（Human3.6M 和 CMU 运动捕捉（CMU Mocap））上进行了评估，这两个数据集包含了大量的视频，其中既有较短的人体运动序列，也有较长的人体运动序列。结果表明，在这两个数据集上，所提出的模型都优于最先进的方法，在短期和长期运动预测方面，平均每个关节位置误差（avg-MPJPE）分别显著提高了 3.50% 和 11.45%。同样，在 CMU Mocap 数据集上，它在短期和长期运动预测方面的平均每个关节位置误差（avg-MPJPE）分别提高了 2.69% 和 1.05%，这表明它在预测长时间人体运动方面具有更高的准确性。研究还调查了不同数量的 T 变换器和 S-GCN 的影响，并探讨了 T 变换器、S-GCN 和跨部分组件的具体作用和贡献。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Transactions on Human-Machine Systems COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE-COMPUTER SCIENCE, CYBERNETICS

CiteScore

7.10

自引率

11.10%

发文量

136

期刊介绍： The scope of the IEEE Transactions on Human-Machine Systems includes the fields of human machine systems. It covers human systems and human organizational interactions including cognitive ergonomics, system test and evaluation, and human information processing concerns in systems and organizations.