{"title":"BCDPose: Diffusion-based 3D Human Pose Estimation with bone-chain prior knowledge","authors":"Xing Liu , Hao Tang","doi":"10.1016/j.imavis.2025.105636","DOIUrl":null,"url":null,"abstract":"<div><div>Recently, diffusion-based methods have emerged as the golden standard in 3D Human Pose Estimation task, largely thanks to their exceptional generative capabilities. In the past, researchers have made concerted efforts to develop spatial and temporal denoisers utilizing transformer blocks in diffusion-based methods. However, existing Transformer-based denoisers in diffusion models often overlook implicit structural and kinematic supervision derived from prior knowledge of human biomechanics, including prior knowledge of human bone-chain structure and joint kinematics, which could otherwise enhance performance. We hold the view that joint movements are intrinsically constrained by neighboring joints within the bone-chain and by kinematic hierarchies. Then, we propose a <strong>B</strong>one-<strong>C</strong>hain enhanced <strong>D</strong>iffusion 3D pose estimation method, or <strong>BCDPose</strong>. In this method, we introduce a novel Bone-Chain prior knowledge enhanced transformer blocks within the denoiser to reconstruct contaminated 3D pose data. Additionally, we propose the Joint-DoF Hierarchical Temporal Embedding framework, which incorporates prior knowledge of joint kinematics. By integrating body hierarchy and temporal dependencies, this framework effectively captures the complexity of human motion, thereby enabling accurate and robust pose estimation. This innovation proposes a comprehensive framework for 3D human pose estimation by explicitly modeling joint kinematics, thereby overcoming the limitations of prior methods that fail to capture the intrinsic dynamics of human motion. We conduct extensive experiments on various open benchmarks to evaluate the effectiveness of BCDPose. The results convincingly demonstrate that BCDPose achieves highly competitive results compared with other state-of-the-art models. This underscores its promising potential and practical applicability in 2D–3D human pose estimation tasks. We plan to release our code publicly for further research.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"162 ","pages":"Article 105636"},"PeriodicalIF":4.2000,"publicationDate":"2025-07-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Image and Vision Computing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0262885625002240","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Recently, diffusion-based methods have emerged as the golden standard in 3D Human Pose Estimation task, largely thanks to their exceptional generative capabilities. In the past, researchers have made concerted efforts to develop spatial and temporal denoisers utilizing transformer blocks in diffusion-based methods. However, existing Transformer-based denoisers in diffusion models often overlook implicit structural and kinematic supervision derived from prior knowledge of human biomechanics, including prior knowledge of human bone-chain structure and joint kinematics, which could otherwise enhance performance. We hold the view that joint movements are intrinsically constrained by neighboring joints within the bone-chain and by kinematic hierarchies. Then, we propose a Bone-Chain enhanced Diffusion 3D pose estimation method, or BCDPose. In this method, we introduce a novel Bone-Chain prior knowledge enhanced transformer blocks within the denoiser to reconstruct contaminated 3D pose data. Additionally, we propose the Joint-DoF Hierarchical Temporal Embedding framework, which incorporates prior knowledge of joint kinematics. By integrating body hierarchy and temporal dependencies, this framework effectively captures the complexity of human motion, thereby enabling accurate and robust pose estimation. This innovation proposes a comprehensive framework for 3D human pose estimation by explicitly modeling joint kinematics, thereby overcoming the limitations of prior methods that fail to capture the intrinsic dynamics of human motion. We conduct extensive experiments on various open benchmarks to evaluate the effectiveness of BCDPose. The results convincingly demonstrate that BCDPose achieves highly competitive results compared with other state-of-the-art models. This underscores its promising potential and practical applicability in 2D–3D human pose estimation tasks. We plan to release our code publicly for further research.
期刊介绍:
Image and Vision Computing has as a primary aim the provision of an effective medium of interchange for the results of high quality theoretical and applied research fundamental to all aspects of image interpretation and computer vision. The journal publishes work that proposes new image interpretation and computer vision methodology or addresses the application of such methods to real world scenes. It seeks to strengthen a deeper understanding in the discipline by encouraging the quantitative comparison and performance evaluation of the proposed methodology. The coverage includes: image interpretation, scene modelling, object recognition and tracking, shape analysis, monitoring and surveillance, active vision and robotic systems, SLAM, biologically-inspired computer vision, motion analysis, stereo vision, document image understanding, character and handwritten text recognition, face and gesture recognition, biometrics, vision-based human-computer interaction, human activity and behavior understanding, data fusion from multiple sensor inputs, image databases.