Grasping Hand Pose Estimation from RGB Images Using Digital Human Model by Convolutional Neural Network

Proceedings of 3DBODY.TECH 2018 - 9th International Conference and Exhibition on 3D Body Scanning and Processing Technologies, Lugano, Switzerland, 16-17 Oct. 2018 Pub Date : 2018-10-16 DOI:10.15221/18.154

Kentaro Ino, Naoto Ienaga, Yuta Sugiura, H. Saito, N. Miyata, M. Tada

{"title":"Grasping Hand Pose Estimation from RGB Images Using Digital Human Model by Convolutional Neural Network","authors":"Kentaro Ino, Naoto Ienaga, Yuta Sugiura, H. Saito, N. Miyata, M. Tada","doi":"10.15221/18.154","DOIUrl":null,"url":null,"abstract":"Recently, there has been an increase in research estimating hand poses using images. Due to the hand’s high degree of freedom and self-occlusion, multi-view or depth images are often used. Our objective was to estimate hand poses specifically while grasping objects. When holding something, the hand moves in many directions. However, if the camera is too distant from the hand, it may move out of range. Widening the viewing angle, however, reduces the resolution beyond usable limits. One possible solution was developed by Kashiwagi by setting the camera on an object, the hand’s pose can be estimated regardless of its position. However, Kashiwagi's method cannot be used without estimating the fingertips’ positions. Recently, another method using a convolutional neural network (CNN), useful for estimating complex poses, has been proposed. Unfortunately, it is difficult to collect the large number of images with ground truth needed for learning. In this research, we focused on creating a large dataset by generating hand pose images using a digital human model and motioncaptured data using DhaibaWorks. We evaluated the model by calculating the distance of the estimated pose and ground truth of the test data, which was approximately 12.3 mm on average. In comparison, the average distance in related work was 18.5 mm. We also tested our method with ordinary camera images and confirmed that it can be used in the real world. Our method provides a new means of dataset generation: annotations are done automatically with motion capture technology, which reduces the time required. In future work, we will improve the architecture of the CNN and shorten the execution time for real-time processing.","PeriodicalId":416022,"journal":{"name":"Proceedings of 3DBODY.TECH 2018 - 9th International Conference and Exhibition on 3D Body Scanning and Processing Technologies, Lugano, Switzerland, 16-17 Oct. 2018","volume":"65 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-10-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of 3DBODY.TECH 2018 - 9th International Conference and Exhibition on 3D Body Scanning and Processing Technologies, Lugano, Switzerland, 16-17 Oct. 2018","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.15221/18.154","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

Abstract

Recently, there has been an increase in research estimating hand poses using images. Due to the hand’s high degree of freedom and self-occlusion, multi-view or depth images are often used. Our objective was to estimate hand poses specifically while grasping objects. When holding something, the hand moves in many directions. However, if the camera is too distant from the hand, it may move out of range. Widening the viewing angle, however, reduces the resolution beyond usable limits. One possible solution was developed by Kashiwagi by setting the camera on an object, the hand’s pose can be estimated regardless of its position. However, Kashiwagi's method cannot be used without estimating the fingertips’ positions. Recently, another method using a convolutional neural network (CNN), useful for estimating complex poses, has been proposed. Unfortunately, it is difficult to collect the large number of images with ground truth needed for learning. In this research, we focused on creating a large dataset by generating hand pose images using a digital human model and motioncaptured data using DhaibaWorks. We evaluated the model by calculating the distance of the estimated pose and ground truth of the test data, which was approximately 12.3 mm on average. In comparison, the average distance in related work was 18.5 mm. We also tested our method with ordinary camera images and confirmed that it can be used in the real world. Our method provides a new means of dataset generation: annotations are done automatically with motion capture technology, which reduces the time required. In future work, we will improve the architecture of the CNN and shorten the execution time for real-time processing.

查看原文本刊更多论文

基于卷积神经网络的RGB图像抓手姿态估计

最近，利用图像估计手部姿势的研究有所增加。由于手的高度自由和自遮挡，经常使用多视图或深度图像。我们的目标是评估手在抓取物体时的具体姿势。当拿东西时，手会向多个方向移动。但是，如果相机离手太远，它可能会移动到范围之外。然而，扩大视角会使分辨率降低到无法使用的极限。Kashiwagi提出了一种可能的解决方案，将相机放在一个物体上，不管它的位置如何，手的姿势都可以被估计出来。然而，Kashiwagi的方法不能在不估计指尖位置的情况下使用。最近，人们提出了另一种使用卷积神经网络(CNN)的方法，用于估计复杂的姿势。不幸的是，很难收集大量具有学习所需的地面真实图像。在这项研究中，我们专注于通过使用数字人体模型生成手部姿势图像和使用DhaibaWorks捕获动作数据来创建大型数据集。我们通过计算测试数据的估计姿态和地面真值的距离来评估模型，平均约为12.3 mm。相比之下，相关工作的平均距离为18.5 mm。我们还用普通的相机图像测试了我们的方法，并证实了它可以在现实世界中使用。我们的方法提供了一种新的数据集生成方法:使用动作捕捉技术自动完成注释，从而减少了所需的时间。在未来的工作中，我们将改进CNN的架构，缩短实时处理的执行时间。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of 3DBODY.TECH 2018 - 9th International Conference and Exhibition on 3D Body Scanning and Processing Technologies, Lugano, Switzerland, 16-17 Oct. 2018

自引率

0.00%

发文量