BCDPose: Diffusion-based 3D Human Pose Estimation with bone-chain prior knowledge

IF 4.2 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Image and Vision Computing Pub Date : 2025-07-08 DOI:10.1016/j.imavis.2025.105636

Xing Liu , Hao Tang

{"title":"BCDPose: Diffusion-based 3D Human Pose Estimation with bone-chain prior knowledge","authors":"Xing Liu , Hao Tang","doi":"10.1016/j.imavis.2025.105636","DOIUrl":null,"url":null,"abstract":"<div><div>Recently, diffusion-based methods have emerged as the golden standard in 3D Human Pose Estimation task, largely thanks to their exceptional generative capabilities. In the past, researchers have made concerted efforts to develop spatial and temporal denoisers utilizing transformer blocks in diffusion-based methods. However, existing Transformer-based denoisers in diffusion models often overlook implicit structural and kinematic supervision derived from prior knowledge of human biomechanics, including prior knowledge of human bone-chain structure and joint kinematics, which could otherwise enhance performance. We hold the view that joint movements are intrinsically constrained by neighboring joints within the bone-chain and by kinematic hierarchies. Then, we propose a <strong>B</strong>one-<strong>C</strong>hain enhanced <strong>D</strong>iffusion 3D pose estimation method, or <strong>BCDPose</strong>. In this method, we introduce a novel Bone-Chain prior knowledge enhanced transformer blocks within the denoiser to reconstruct contaminated 3D pose data. Additionally, we propose the Joint-DoF Hierarchical Temporal Embedding framework, which incorporates prior knowledge of joint kinematics. By integrating body hierarchy and temporal dependencies, this framework effectively captures the complexity of human motion, thereby enabling accurate and robust pose estimation. This innovation proposes a comprehensive framework for 3D human pose estimation by explicitly modeling joint kinematics, thereby overcoming the limitations of prior methods that fail to capture the intrinsic dynamics of human motion. We conduct extensive experiments on various open benchmarks to evaluate the effectiveness of BCDPose. The results convincingly demonstrate that BCDPose achieves highly competitive results compared with other state-of-the-art models. This underscores its promising potential and practical applicability in 2D–3D human pose estimation tasks. We plan to release our code publicly for further research.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"162 ","pages":"Article 105636"},"PeriodicalIF":4.2000,"publicationDate":"2025-07-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Image and Vision Computing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0262885625002240","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Recently, diffusion-based methods have emerged as the golden standard in 3D Human Pose Estimation task, largely thanks to their exceptional generative capabilities. In the past, researchers have made concerted efforts to develop spatial and temporal denoisers utilizing transformer blocks in diffusion-based methods. However, existing Transformer-based denoisers in diffusion models often overlook implicit structural and kinematic supervision derived from prior knowledge of human biomechanics, including prior knowledge of human bone-chain structure and joint kinematics, which could otherwise enhance performance. We hold the view that joint movements are intrinsically constrained by neighboring joints within the bone-chain and by kinematic hierarchies. Then, we propose a Bone-Chain enhanced Diffusion 3D pose estimation method, or BCDPose. In this method, we introduce a novel Bone-Chain prior knowledge enhanced transformer blocks within the denoiser to reconstruct contaminated 3D pose data. Additionally, we propose the Joint-DoF Hierarchical Temporal Embedding framework, which incorporates prior knowledge of joint kinematics. By integrating body hierarchy and temporal dependencies, this framework effectively captures the complexity of human motion, thereby enabling accurate and robust pose estimation. This innovation proposes a comprehensive framework for 3D human pose estimation by explicitly modeling joint kinematics, thereby overcoming the limitations of prior methods that fail to capture the intrinsic dynamics of human motion. We conduct extensive experiments on various open benchmarks to evaluate the effectiveness of BCDPose. The results convincingly demonstrate that BCDPose achieves highly competitive results compared with other state-of-the-art models. This underscores its promising potential and practical applicability in 2D–3D human pose estimation tasks. We plan to release our code publicly for further research.

Abstract Image

查看原文本刊更多论文

BCDPose：基于骨链先验知识的扩散三维人体姿态估计

近年来，基于扩散的方法已成为3D人体姿态估计任务的黄金标准，这在很大程度上要归功于其卓越的生成能力。过去，研究人员在基于扩散的方法中利用变压器块开发空间和时间去噪器。然而，现有的扩散模型中基于变压器的去噪器往往忽略了基于人类生物力学先验知识的隐式结构和运动学监督，包括人类骨链结构和关节运动学的先验知识，这些知识可以提高性能。我们认为关节运动本质上受到骨链内相邻关节和运动学层次的约束。然后，我们提出了一种骨链增强扩散三维姿态估计方法（BCDPose）。在该方法中，我们在去噪器中引入一种新的骨链先验知识增强变压器块来重建受污染的三维位姿数据。此外，我们提出了结合关节运动学先验知识的关节自由度分层时间嵌入框架。通过整合身体层次和时间依赖关系，该框架有效地捕获了人体运动的复杂性，从而实现了准确和鲁棒的姿态估计。这一创新提出了一个通过显式建模关节运动学来进行三维人体姿态估计的综合框架，从而克服了先前方法无法捕获人体运动内在动力学的局限性。我们在各种开放基准上进行了大量的实验来评估BCDPose的有效性。结果令人信服地表明，与其他最先进的模型相比，BCDPose取得了极具竞争力的结果。这强调了其在2D-3D人体姿态估计任务中的潜力和实际适用性。我们计划公开发布我们的代码以进行进一步的研究。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Image and Vision Computing 工程技术-工程：电子与电气

CiteScore

8.50

自引率

8.50%

发文量

143

审稿时长

7.8 months

期刊介绍： Image and Vision Computing has as a primary aim the provision of an effective medium of interchange for the results of high quality theoretical and applied research fundamental to all aspects of image interpretation and computer vision. The journal publishes work that proposes new image interpretation and computer vision methodology or addresses the application of such methods to real world scenes. It seeks to strengthen a deeper understanding in the discipline by encouraging the quantitative comparison and performance evaluation of the proposed methodology. The coverage includes: image interpretation, scene modelling, object recognition and tracking, shape analysis, monitoring and surveillance, active vision and robotic systems, SLAM, biologically-inspired computer vision, motion analysis, stereo vision, document image understanding, character and handwritten text recognition, face and gesture recognition, biometrics, vision-based human-computer interaction, human activity and behavior understanding, data fusion from multiple sensor inputs, image databases.