Kentaro Ino, Naoto Ienaga, Yuta Sugiura, H. Saito, N. Miyata, M. Tada
{"title":"Grasping Hand Pose Estimation from RGB Images Using Digital Human Model by Convolutional Neural Network","authors":"Kentaro Ino, Naoto Ienaga, Yuta Sugiura, H. Saito, N. Miyata, M. Tada","doi":"10.15221/18.154","DOIUrl":null,"url":null,"abstract":"Recently, there has been an increase in research estimating hand poses using images. Due to the hand’s high degree of freedom and self-occlusion, multi-view or depth images are often used. Our objective was to estimate hand poses specifically while grasping objects. When holding something, the hand moves in many directions. However, if the camera is too distant from the hand, it may move out of range. Widening the viewing angle, however, reduces the resolution beyond usable limits. One possible solution was developed by Kashiwagi by setting the camera on an object, the hand’s pose can be estimated regardless of its position. However, Kashiwagi's method cannot be used without estimating the fingertips’ positions. Recently, another method using a convolutional neural network (CNN), useful for estimating complex poses, has been proposed. Unfortunately, it is difficult to collect the large number of images with ground truth needed for learning. In this research, we focused on creating a large dataset by generating hand pose images using a digital human model and motioncaptured data using DhaibaWorks. We evaluated the model by calculating the distance of the estimated pose and ground truth of the test data, which was approximately 12.3 mm on average. In comparison, the average distance in related work was 18.5 mm. We also tested our method with ordinary camera images and confirmed that it can be used in the real world. Our method provides a new means of dataset generation: annotations are done automatically with motion capture technology, which reduces the time required. In future work, we will improve the architecture of the CNN and shorten the execution time for real-time processing.","PeriodicalId":416022,"journal":{"name":"Proceedings of 3DBODY.TECH 2018 - 9th International Conference and Exhibition on 3D Body Scanning and Processing Technologies, Lugano, Switzerland, 16-17 Oct. 2018","volume":"65 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-10-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of 3DBODY.TECH 2018 - 9th International Conference and Exhibition on 3D Body Scanning and Processing Technologies, Lugano, Switzerland, 16-17 Oct. 2018","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.15221/18.154","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2
Abstract
Recently, there has been an increase in research estimating hand poses using images. Due to the hand’s high degree of freedom and self-occlusion, multi-view or depth images are often used. Our objective was to estimate hand poses specifically while grasping objects. When holding something, the hand moves in many directions. However, if the camera is too distant from the hand, it may move out of range. Widening the viewing angle, however, reduces the resolution beyond usable limits. One possible solution was developed by Kashiwagi by setting the camera on an object, the hand’s pose can be estimated regardless of its position. However, Kashiwagi's method cannot be used without estimating the fingertips’ positions. Recently, another method using a convolutional neural network (CNN), useful for estimating complex poses, has been proposed. Unfortunately, it is difficult to collect the large number of images with ground truth needed for learning. In this research, we focused on creating a large dataset by generating hand pose images using a digital human model and motioncaptured data using DhaibaWorks. We evaluated the model by calculating the distance of the estimated pose and ground truth of the test data, which was approximately 12.3 mm on average. In comparison, the average distance in related work was 18.5 mm. We also tested our method with ordinary camera images and confirmed that it can be used in the real world. Our method provides a new means of dataset generation: annotations are done automatically with motion capture technology, which reduces the time required. In future work, we will improve the architecture of the CNN and shorten the execution time for real-time processing.