End-to-end weakly-supervised single-stage multiple 3D hand mesh reconstruction from a single RGB image

Comput. Vis. Image Underst. Pub Date : 2022-04-18 DOI:10.2139/ssrn.4199294

Jinwei Ren, Jianke Zhu, Jialiang Zhang

{"title":"End-to-end weakly-supervised single-stage multiple 3D hand mesh reconstruction from a single RGB image","authors":"Jinwei Ren, Jianke Zhu, Jialiang Zhang","doi":"10.2139/ssrn.4199294","DOIUrl":null,"url":null,"abstract":"In this paper, we consider the challenging task of simultaneously locating and recovering multiple hands from a single 2D image. Previous studies either focus on single hand reconstruction or solve this problem in a multi-stage way. Moreover, the conventional two-stage pipeline firstly detects hand areas, and then estimates 3D hand pose from each cropped patch. To reduce the computational redundancy in preprocessing and feature extraction, for the first time, we propose a concise but efficient single-stage pipeline for multi-hand reconstruction. Specifically, we design a multi-head auto-encoder structure, where each head network shares the same feature map and outputs the hand center, pose and texture, respectively. Besides, we adopt a weakly-supervised scheme to alleviate the burden of expensive 3D real-world data annotations. To this end, we propose a series of losses optimized by a stage-wise training scheme, where a multi-hand dataset with 2D annotations is generated based on the publicly available single hand datasets. In order to further improve the accuracy of the weakly supervised model, we adopt several feature consistency constraints in both single and multiple hand settings. Specifically, the keypoints of each hand estimated from local features should be consistent with the re-projected points predicted from global features. Extensive experiments on public benchmarks including FreiHAND, HO3D, InterHand2.6M and RHD demonstrate that our method outperforms the state-of-the-art model-based methods in both weakly-supervised and fully-supervised manners. The code and models are available at {https://github.com/zijinxuxu/SMHR}.","PeriodicalId":10549,"journal":{"name":"Comput. Vis. Image Underst.","volume":"10 1","pages":"103706"},"PeriodicalIF":0.0000,"publicationDate":"2022-04-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Comput. Vis. Image Underst.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.2139/ssrn.4199294","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

Abstract

In this paper, we consider the challenging task of simultaneously locating and recovering multiple hands from a single 2D image. Previous studies either focus on single hand reconstruction or solve this problem in a multi-stage way. Moreover, the conventional two-stage pipeline firstly detects hand areas, and then estimates 3D hand pose from each cropped patch. To reduce the computational redundancy in preprocessing and feature extraction, for the first time, we propose a concise but efficient single-stage pipeline for multi-hand reconstruction. Specifically, we design a multi-head auto-encoder structure, where each head network shares the same feature map and outputs the hand center, pose and texture, respectively. Besides, we adopt a weakly-supervised scheme to alleviate the burden of expensive 3D real-world data annotations. To this end, we propose a series of losses optimized by a stage-wise training scheme, where a multi-hand dataset with 2D annotations is generated based on the publicly available single hand datasets. In order to further improve the accuracy of the weakly supervised model, we adopt several feature consistency constraints in both single and multiple hand settings. Specifically, the keypoints of each hand estimated from local features should be consistent with the re-projected points predicted from global features. Extensive experiments on public benchmarks including FreiHAND, HO3D, InterHand2.6M and RHD demonstrate that our method outperforms the state-of-the-art model-based methods in both weakly-supervised and fully-supervised manners. The code and models are available at {https://github.com/zijinxuxu/SMHR}.

查看原文本刊更多论文

端到端弱监督单阶段多三维手工网格重建从单个RGB图像

在本文中，我们考虑了从单个2D图像中同时定位和恢复多只手的挑战性任务。以往的研究要么集中在单手重建，要么采用多阶段的方法解决这一问题。此外，传统的两阶段流水线首先检测手部区域，然后从每个裁剪的斑块中估计3D手部姿态。为了减少预处理和特征提取中的计算冗余，我们首次提出了一种简洁高效的单级多手重建管道。具体来说，我们设计了一个多头自编码器结构，其中每个头部网络共享相同的特征映射，并分别输出手的中心、姿态和纹理。此外，我们采用了一种弱监督的方案来减轻昂贵的3D真实数据注释的负担。为此，我们提出了一系列通过阶段智能训练方案优化的损失，其中基于公开可用的单手数据集生成具有2D注释的多手数据集。为了进一步提高弱监督模型的准确性，我们在单手和多手设置中都采用了几个特征一致性约束。具体来说，从局部特征估计的每只手的关键点应该与从全局特征预测的重投影点一致。在包括FreiHAND、HO3D、InterHand2.6M和RHD在内的公共基准上进行的大量实验表明，我们的方法在弱监督和全监督两方面都优于最先进的基于模型的方法。代码和模型可在{https://github.com/zijinxuxu/SMHR}上获得。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Comput. Vis. Image Underst.

自引率

0.00%

发文量