MH-Net: Multiheaded 3D Hand Pose Estimation Network With 3D Anchorsets and Improved Multiscale Vision Transformer

IF 14 1区工程技术 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

IEEE Transactions on Intelligent Vehicles Pub Date : 2024-04-11 DOI:10.1109/TIV.2024.3387344

Tekie Tsegay Tewolde;Ali Asghar Manjotho;Zhendong Niu

{"title":"MH-Net: Multiheaded 3D Hand Pose Estimation Network With 3D Anchorsets and Improved Multiscale Vision Transformer","authors":"Tekie Tsegay Tewolde;Ali Asghar Manjotho;Zhendong Niu","doi":"10.1109/TIV.2024.3387344","DOIUrl":null,"url":null,"abstract":"Accurate 3D hand pose estimation is a challenging computer vision problem primarily because of self-occlusion and viewpoint variations. Existing methods address viewpoint variations by applying data-centric transformations, such as data alignments or generating multiple views, which are prone to data sensitivity, error propagation, and prohibitive computational requirements. We improve the estimation accuracy by mitigating the impact of self-occlusion and viewpoint variations from the network side and propose MH-Net, a novel multiheaded network for accurate 3D hand pose estimation from a depth image. MH-Net comprises three key components. First, a multiscale feature extraction backbone based on an improved multiscale vision transformer (MViTv2) is proposed to extract shift-invariant global features. Second, a 3D anchorset generator is proposed to generate three disjoint sets of 3D anchors that serve two purposes: formulating hand pose estimation as an anchor-to-joint offset estimation and defining three unique viewpoints from a single depth image. Third, three identical regression heads are proposed to regress 3D joint positions based on unique viewpoints defined by their respective anchorsets. Extensive ablation studies have been conducted to investigate the impact of anchorsets, regression heads, and feature extraction backbones. Experiments on three public datasets, ICVL, MSRA, and NYU, show significant improvements over the state-of-the-art.","PeriodicalId":36532,"journal":{"name":"IEEE Transactions on Intelligent Vehicles","volume":"9 10","pages":"6660-6671"},"PeriodicalIF":14.0000,"publicationDate":"2024-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Intelligent Vehicles","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10496825/","RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Accurate 3D hand pose estimation is a challenging computer vision problem primarily because of self-occlusion and viewpoint variations. Existing methods address viewpoint variations by applying data-centric transformations, such as data alignments or generating multiple views, which are prone to data sensitivity, error propagation, and prohibitive computational requirements. We improve the estimation accuracy by mitigating the impact of self-occlusion and viewpoint variations from the network side and propose MH-Net, a novel multiheaded network for accurate 3D hand pose estimation from a depth image. MH-Net comprises three key components. First, a multiscale feature extraction backbone based on an improved multiscale vision transformer (MViTv2) is proposed to extract shift-invariant global features. Second, a 3D anchorset generator is proposed to generate three disjoint sets of 3D anchors that serve two purposes: formulating hand pose estimation as an anchor-to-joint offset estimation and defining three unique viewpoints from a single depth image. Third, three identical regression heads are proposed to regress 3D joint positions based on unique viewpoints defined by their respective anchorsets. Extensive ablation studies have been conducted to investigate the impact of anchorsets, regression heads, and feature extraction backbones. Experiments on three public datasets, ICVL, MSRA, and NYU, show significant improvements over the state-of-the-art.

查看原文本刊更多论文

MH-Net：多头三维手部姿态估计网络与三维锚定和改进的多尺度视觉变压器

准确的三维手部姿态估计是一个具有挑战性的计算机视觉问题，主要是因为自遮挡和视点变化。现有方法通过应用以数据为中心的转换（例如数据对齐或生成多个视图）来处理视点变化，这些转换容易产生数据敏感性、错误传播和令人望而却步的计算需求。我们通过减轻网络侧自遮挡和视点变化的影响来提高估计精度，并提出了一种新的多头网络MH-Net，用于从深度图像中精确估计3D手部姿势。MH-Net由三个关键部分组成。首先，提出了一种基于改进的多尺度视觉转换器（MViTv2）的多尺度特征提取主干，提取平移不变性全局特征；其次，提出了一种三维锚点生成器，用于生成三个不相交的三维锚点集，这些锚点集有两个目的：将手部姿态估计表述为锚点到关节的偏移估计，并从单个深度图像中定义三个独特的视点。第三，提出三个相同的回归头，根据各自锚定集定义的独特视点回归三维关节位置。广泛的消融研究已经进行，以调查锚定器，回归头和特征提取骨干的影响。在三个公共数据集（ICVL、MSRA和NYU）上进行的实验表明，与最先进的技术相比，这种方法有了显著的改进。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Transactions on Intelligent Vehicles Mathematics-Control and Optimization

CiteScore

12.10

自引率

13.40%

发文量

177

期刊介绍： The IEEE Transactions on Intelligent Vehicles (T-IV) is a premier platform for publishing peer-reviewed articles that present innovative research concepts, application results, significant theoretical findings, and application case studies in the field of intelligent vehicles. With a particular emphasis on automated vehicles within roadway environments, T-IV aims to raise awareness of pressing research and application challenges. Our focus is on providing critical information to the intelligent vehicle community, serving as a dissemination vehicle for IEEE ITS Society members and others interested in learning about the state-of-the-art developments and progress in research and applications related to intelligent vehicles. Join us in advancing knowledge and innovation in this dynamic field.