MH-Net: Multiheaded 3D Hand Pose Estimation Network With 3D Anchorsets and Improved Multiscale Vision Transformer

IF 14 1区 工程技术 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE
Tekie Tsegay Tewolde;Ali Asghar Manjotho;Zhendong Niu
{"title":"MH-Net: Multiheaded 3D Hand Pose Estimation Network With 3D Anchorsets and Improved Multiscale Vision Transformer","authors":"Tekie Tsegay Tewolde;Ali Asghar Manjotho;Zhendong Niu","doi":"10.1109/TIV.2024.3387344","DOIUrl":null,"url":null,"abstract":"Accurate 3D hand pose estimation is a challenging computer vision problem primarily because of self-occlusion and viewpoint variations. Existing methods address viewpoint variations by applying data-centric transformations, such as data alignments or generating multiple views, which are prone to data sensitivity, error propagation, and prohibitive computational requirements. We improve the estimation accuracy by mitigating the impact of self-occlusion and viewpoint variations from the network side and propose MH-Net, a novel multiheaded network for accurate 3D hand pose estimation from a depth image. MH-Net comprises three key components. First, a multiscale feature extraction backbone based on an improved multiscale vision transformer (MViTv2) is proposed to extract shift-invariant global features. Second, a 3D anchorset generator is proposed to generate three disjoint sets of 3D anchors that serve two purposes: formulating hand pose estimation as an anchor-to-joint offset estimation and defining three unique viewpoints from a single depth image. Third, three identical regression heads are proposed to regress 3D joint positions based on unique viewpoints defined by their respective anchorsets. Extensive ablation studies have been conducted to investigate the impact of anchorsets, regression heads, and feature extraction backbones. Experiments on three public datasets, ICVL, MSRA, and NYU, show significant improvements over the state-of-the-art.","PeriodicalId":36532,"journal":{"name":"IEEE Transactions on Intelligent Vehicles","volume":"9 10","pages":"6660-6671"},"PeriodicalIF":14.0000,"publicationDate":"2024-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Intelligent Vehicles","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10496825/","RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0

Abstract

Accurate 3D hand pose estimation is a challenging computer vision problem primarily because of self-occlusion and viewpoint variations. Existing methods address viewpoint variations by applying data-centric transformations, such as data alignments or generating multiple views, which are prone to data sensitivity, error propagation, and prohibitive computational requirements. We improve the estimation accuracy by mitigating the impact of self-occlusion and viewpoint variations from the network side and propose MH-Net, a novel multiheaded network for accurate 3D hand pose estimation from a depth image. MH-Net comprises three key components. First, a multiscale feature extraction backbone based on an improved multiscale vision transformer (MViTv2) is proposed to extract shift-invariant global features. Second, a 3D anchorset generator is proposed to generate three disjoint sets of 3D anchors that serve two purposes: formulating hand pose estimation as an anchor-to-joint offset estimation and defining three unique viewpoints from a single depth image. Third, three identical regression heads are proposed to regress 3D joint positions based on unique viewpoints defined by their respective anchorsets. Extensive ablation studies have been conducted to investigate the impact of anchorsets, regression heads, and feature extraction backbones. Experiments on three public datasets, ICVL, MSRA, and NYU, show significant improvements over the state-of-the-art.
MH-Net:多头三维手部姿态估计网络与三维锚定和改进的多尺度视觉变压器
准确的三维手部姿态估计是一个具有挑战性的计算机视觉问题,主要是因为自遮挡和视点变化。现有方法通过应用以数据为中心的转换(例如数据对齐或生成多个视图)来处理视点变化,这些转换容易产生数据敏感性、错误传播和令人望而却步的计算需求。我们通过减轻网络侧自遮挡和视点变化的影响来提高估计精度,并提出了一种新的多头网络MH-Net,用于从深度图像中精确估计3D手部姿势。MH-Net由三个关键部分组成。首先,提出了一种基于改进的多尺度视觉转换器(MViTv2)的多尺度特征提取主干,提取平移不变性全局特征;其次,提出了一种三维锚点生成器,用于生成三个不相交的三维锚点集,这些锚点集有两个目的:将手部姿态估计表述为锚点到关节的偏移估计,并从单个深度图像中定义三个独特的视点。第三,提出三个相同的回归头,根据各自锚定集定义的独特视点回归三维关节位置。广泛的消融研究已经进行,以调查锚定器,回归头和特征提取骨干的影响。在三个公共数据集(ICVL、MSRA和NYU)上进行的实验表明,与最先进的技术相比,这种方法有了显著的改进。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
IEEE Transactions on Intelligent Vehicles
IEEE Transactions on Intelligent Vehicles Mathematics-Control and Optimization
CiteScore
12.10
自引率
13.40%
发文量
177
期刊介绍: The IEEE Transactions on Intelligent Vehicles (T-IV) is a premier platform for publishing peer-reviewed articles that present innovative research concepts, application results, significant theoretical findings, and application case studies in the field of intelligent vehicles. With a particular emphasis on automated vehicles within roadway environments, T-IV aims to raise awareness of pressing research and application challenges. Our focus is on providing critical information to the intelligent vehicle community, serving as a dissemination vehicle for IEEE ITS Society members and others interested in learning about the state-of-the-art developments and progress in research and applications related to intelligent vehicles. Join us in advancing knowledge and innovation in this dynamic field.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信