3D Hand Pose Estimation via Articulated Anchor-to-Joint 3D Local Regressors.

IF 18.6 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

IEEE Transactions on Pattern Analysis and Machine Intelligence Pub Date : 2025-09-16 DOI:10.1109/tpami.2025.3609907

Changlong Jiang,Yang Xiao,Jinghong Zheng,Haohong Kuang,Cunlin Wu,Mingyang Zhang,Zhiguo Cao,Min Du,Joey Tianyi Zhou,Junsong Yuan

{"title":"3D Hand Pose Estimation via Articulated Anchor-to-Joint 3D Local Regressors.","authors":"Changlong Jiang,Yang Xiao,Jinghong Zheng,Haohong Kuang,Cunlin Wu,Mingyang Zhang,Zhiguo Cao,Min Du,Joey Tianyi Zhou,Junsong Yuan","doi":"10.1109/tpami.2025.3609907","DOIUrl":null,"url":null,"abstract":"In this paper, we propose to address monocular 3D hand pose estimation from a single RGB or depth image via articulated anchor-to-joint 3D local regressors, in form of A2J-Transformer+. The key idea is to make the local regressors (i.e., anchor points) in 3D space be aware of hand's local fine details and global articulated context jointly, to facilitate predicting their 3D offsets toward hand joints with linear weighted aggregation for joint localization. Our intuition is that, local fine details help to estimate accurate offset but may suffer from the issues including serious occlusion, confusing similar patterns, and overfitting risk. On the other hand, hand's global articulated context can essentially provide additional descriptive clues and constraints to alleviate these issues. To set anchor points adaptively in 3D space, A2J-Transformer+ runs in a 2-stage manner. At the first stage, since the input modality property anchor points distribute more densely on X-Y plane, it leads to lower prediction accuracy along Z direction compared with those in the X and Y directions. To alleviate this, at the second stage anchor points are set near the joints yielded by the first stage evenly along X, Y, and Z directions. This treatment brings two main advantages: (1) balancing the prediction accuracy along X, Y, and Z directions, and (2) ensuring the anchor-joint offsets are of small values relatively easy to estimate. Wide-range experiments on three RGB hand datasets (InterHand2.6M, HO-3D V2 and RHP) and three depth hand datasets (NYU, ICVL and HANDS 2017) verify A2J-Transformer+'s superiority and generalization ability for different modalities (i.e., RGB and depth) and hand cases (i.e., single hand, interacting hands, and hand-object interaction), even outperforming model-based manners. The test on ITOP dataset reveals that, A2J-Transformer+ can also be applied to 3D human pose estimation task. The source code and supporting material will be released upon acceptance.","PeriodicalId":13426,"journal":{"name":"IEEE Transactions on Pattern Analysis and Machine Intelligence","volume":"84 1","pages":""},"PeriodicalIF":18.6000,"publicationDate":"2025-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Pattern Analysis and Machine Intelligence","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1109/tpami.2025.3609907","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

In this paper, we propose to address monocular 3D hand pose estimation from a single RGB or depth image via articulated anchor-to-joint 3D local regressors, in form of A2J-Transformer+. The key idea is to make the local regressors (i.e., anchor points) in 3D space be aware of hand's local fine details and global articulated context jointly, to facilitate predicting their 3D offsets toward hand joints with linear weighted aggregation for joint localization. Our intuition is that, local fine details help to estimate accurate offset but may suffer from the issues including serious occlusion, confusing similar patterns, and overfitting risk. On the other hand, hand's global articulated context can essentially provide additional descriptive clues and constraints to alleviate these issues. To set anchor points adaptively in 3D space, A2J-Transformer+ runs in a 2-stage manner. At the first stage, since the input modality property anchor points distribute more densely on X-Y plane, it leads to lower prediction accuracy along Z direction compared with those in the X and Y directions. To alleviate this, at the second stage anchor points are set near the joints yielded by the first stage evenly along X, Y, and Z directions. This treatment brings two main advantages: (1) balancing the prediction accuracy along X, Y, and Z directions, and (2) ensuring the anchor-joint offsets are of small values relatively easy to estimate. Wide-range experiments on three RGB hand datasets (InterHand2.6M, HO-3D V2 and RHP) and three depth hand datasets (NYU, ICVL and HANDS 2017) verify A2J-Transformer+'s superiority and generalization ability for different modalities (i.e., RGB and depth) and hand cases (i.e., single hand, interacting hands, and hand-object interaction), even outperforming model-based manners. The test on ITOP dataset reveals that, A2J-Transformer+ can also be applied to 3D human pose estimation task. The source code and supporting material will be released upon acceptance.

查看原文本刊更多论文

基于关节锚点-关节三维局部回归量的三维手部姿态估计。

在本文中，我们提出通过铰接的锚点到关节的3D局部回归器，以A2J-Transformer+的形式，从单个RGB或深度图像中解决单眼3D手部姿态估计问题。其关键思想是使三维空间中的局部回归量（即锚点）共同感知手部的局部精细细节和全局铰接点上下文，便于用线性加权聚合预测它们对手部关节的三维偏移量，从而实现关节定位。我们的直觉是，局部精细细节有助于估计准确的偏移量，但可能会受到严重遮挡、混淆相似模式和过度拟合风险等问题的影响。另一方面，hand的全局铰接上下文本质上可以提供额外的描述性线索和约束来缓解这些问题。为了在三维空间中自适应地设置锚点，aj - transformer +以两段方式运行。在第一阶段，由于输入模态属性锚点在X-Y平面上分布更密集，导致Z方向的预测精度低于X和Y方向的预测精度。为了缓解这一问题，在第二阶段，锚点沿X、Y和Z方向均匀地设置在第一阶段产生的节理附近。这种处理带来了两个主要优点：(1)平衡了沿X、Y和Z方向的预测精度；(2)确保锚-节点偏移量的值较小，相对容易估计。在三个RGB手部数据集（InterHand2.6M、HO-3D V2和RHP）和三个深度手部数据集（NYU、ICVL和HANDS 2017）上进行的大范围实验验证了A2J-Transformer+在不同模态（RGB和深度）和手部情况（单手、交互手和手物交互）下的优势和泛化能力，甚至优于基于模型的方式。在ITOP数据集上的测试表明，A2J-Transformer+也可以应用于三维人体姿态估计任务。源代码和支持材料将在验收后发布。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Transactions on Pattern Analysis and Machine Intelligence 工程技术-工程：电子与电气

CiteScore

28.40

自引率

3.00%

发文量

885

审稿时长

8.5 months

期刊介绍： The IEEE Transactions on Pattern Analysis and Machine Intelligence publishes articles on all traditional areas of computer vision and image understanding, all traditional areas of pattern analysis and recognition, and selected areas of machine intelligence, with a particular emphasis on machine learning for pattern analysis. Areas such as techniques for visual search, document and handwriting analysis, medical image analysis, video and image sequence analysis, content-based retrieval of image and video, face and gesture recognition and relevant specialized hardware and/or software architectures are also covered.