{"title":"STELA: Spatial–temporal enhanced learning with an anatomical graph transformer for 3D human pose estimation","authors":"Jian Son, Jiho Lee, Eunwoo Kim","doi":"10.1016/j.cviu.2025.104381","DOIUrl":null,"url":null,"abstract":"<div><div>Transformers have led to remarkable performance improvements in 3D human pose estimation by capturing global dependencies between joints in spatial and temporal aspects. To leverage human body topology information, attempts have been made to incorporate graph representation within a transformer architecture. However, they neglect spatial–temporal anatomical knowledge inherent in the human body, without considering the implicit relationships of non-connected joints. Furthermore, they disregard the movement patterns between joint trajectories, concentrating on the trajectories of individual joints. In this paper, we propose Spatial–Temporal Enhanced Learning with an Anatomical graph transformer (STELA) to aggregate the spatial–temporal global relationships and intricate anatomical relationships between joints. It consists of Global Self-attention (GS) and Anatomical Graph-attention (AG) branches. GS learns long-range dependencies between all joints across entire frames. AG focuses on the anatomical relationships of the human body in the spatial–temporal aspect using skeleton and motion pattern graphs. Extensive experiments demonstrate that STELA outperforms state-of-the-art approaches with an average of 41% fewer parameters, reducing MPJPE by an average of 2.7 mm on Human3.6M and 1.5 mm on MPI-INF-3DHP.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"257 ","pages":"Article 104381"},"PeriodicalIF":4.3000,"publicationDate":"2025-05-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Vision and Image Understanding","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1077314225001043","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Transformers have led to remarkable performance improvements in 3D human pose estimation by capturing global dependencies between joints in spatial and temporal aspects. To leverage human body topology information, attempts have been made to incorporate graph representation within a transformer architecture. However, they neglect spatial–temporal anatomical knowledge inherent in the human body, without considering the implicit relationships of non-connected joints. Furthermore, they disregard the movement patterns between joint trajectories, concentrating on the trajectories of individual joints. In this paper, we propose Spatial–Temporal Enhanced Learning with an Anatomical graph transformer (STELA) to aggregate the spatial–temporal global relationships and intricate anatomical relationships between joints. It consists of Global Self-attention (GS) and Anatomical Graph-attention (AG) branches. GS learns long-range dependencies between all joints across entire frames. AG focuses on the anatomical relationships of the human body in the spatial–temporal aspect using skeleton and motion pattern graphs. Extensive experiments demonstrate that STELA outperforms state-of-the-art approaches with an average of 41% fewer parameters, reducing MPJPE by an average of 2.7 mm on Human3.6M and 1.5 mm on MPI-INF-3DHP.
期刊介绍:
The central focus of this journal is the computer analysis of pictorial information. Computer Vision and Image Understanding publishes papers covering all aspects of image analysis from the low-level, iconic processes of early vision to the high-level, symbolic processes of recognition and interpretation. A wide range of topics in the image understanding area is covered, including papers offering insights that differ from predominant views.
Research Areas Include:
• Theory
• Early vision
• Data structures and representations
• Shape
• Range
• Motion
• Matching and recognition
• Architecture and languages
• Vision systems