{"title":"Fusion of Temporal Transformer and Spatial Graph Convolutional Network for 3-D Skeleton-Parts-Based Human Motion Prediction","authors":"Mayank Lovanshi;Vivek Tiwari;Rajesh Ingle;Swati Jain","doi":"10.1109/THMS.2024.3452133","DOIUrl":null,"url":null,"abstract":"The field of human motion prediction has gained prominence, finding applications in various domains such as intelligent surveillance and human–robot interaction. However, predicting full-body human motion poses challenges in capturing joint interactions, handling diverse movement patterns, managing occlusions, and ensuring real-time performance. To address these challenges, the proposed model adopts a skeleton-parted strategy to dissect the skeleton structure, enhancing coordination and fusion between body parts. This novel method combines transformer-enabled graph convolutional networks for predicting human motion in 3-D skeleton data. It integrates a temporal transformer (T-Transformer) for comprehensive temporal feature extraction and a spatial graph convolutional network (S-GCN) for capturing spatial characteristics of human motion. The model's performance is evaluated on two comprehensive human motion datasets, Human3.6M and CMU motion capture (CMU Mocap), containing numerous videos encompassing short and long human motion sequences. Results indicate that the proposed model outperforms state-of-the-art methods on both datasets, significantly improving the average mean per joint positional error (avg-MPJPE) by 3.50% and 11.45% for short-term and long-term motion prediction, respectively. Similarly, on the CMU Mocap dataset, it achieves avg-MPJPE improvements of 2.69% and 1.05% for short-term and long-term motion prediction, respectively, demonstrating its superior accuracy in predicting human motion over extended periods. The study also investigates the impact of different numbers of T-Transformers and S-GCNs and explores the specific roles and contributions of the T-Transformer, S-GCN, and cross-part components.","PeriodicalId":48916,"journal":{"name":"IEEE Transactions on Human-Machine Systems","volume":"54 6","pages":"788-797"},"PeriodicalIF":3.5000,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Human-Machine Systems","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10678758/","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
The field of human motion prediction has gained prominence, finding applications in various domains such as intelligent surveillance and human–robot interaction. However, predicting full-body human motion poses challenges in capturing joint interactions, handling diverse movement patterns, managing occlusions, and ensuring real-time performance. To address these challenges, the proposed model adopts a skeleton-parted strategy to dissect the skeleton structure, enhancing coordination and fusion between body parts. This novel method combines transformer-enabled graph convolutional networks for predicting human motion in 3-D skeleton data. It integrates a temporal transformer (T-Transformer) for comprehensive temporal feature extraction and a spatial graph convolutional network (S-GCN) for capturing spatial characteristics of human motion. The model's performance is evaluated on two comprehensive human motion datasets, Human3.6M and CMU motion capture (CMU Mocap), containing numerous videos encompassing short and long human motion sequences. Results indicate that the proposed model outperforms state-of-the-art methods on both datasets, significantly improving the average mean per joint positional error (avg-MPJPE) by 3.50% and 11.45% for short-term and long-term motion prediction, respectively. Similarly, on the CMU Mocap dataset, it achieves avg-MPJPE improvements of 2.69% and 1.05% for short-term and long-term motion prediction, respectively, demonstrating its superior accuracy in predicting human motion over extended periods. The study also investigates the impact of different numbers of T-Transformers and S-GCNs and explores the specific roles and contributions of the T-Transformer, S-GCN, and cross-part components.
期刊介绍:
The scope of the IEEE Transactions on Human-Machine Systems includes the fields of human machine systems. It covers human systems and human organizational interactions including cognitive ergonomics, system test and evaluation, and human information processing concerns in systems and organizations.