S. Nguyen, Thi-Thu-Hong Le, Hoang-Bach Nguyen, Thanh-Tung Phan, Chi-Thanh Nguyen, Hai Vu
{"title":"基于HOPE-Net和Mask R-CNN的自中心视觉手部姿态估计改进","authors":"S. Nguyen, Thi-Thu-Hong Le, Hoang-Bach Nguyen, Thanh-Tung Phan, Chi-Thanh Nguyen, Hai Vu","doi":"10.1109/MAPR56351.2022.9924768","DOIUrl":null,"url":null,"abstract":"Hand pose estimation is the task of predicting the position and orientation of the hand and fingers relative to some coordinate system. It is an important task or input for applications in robotics, medical or human-computer interaction. In recent years, the success of deep convolutional neural networks and the popularity of low-cost consumer wearable cameras have made hand pose estimation on egocentric images using deep neural networks a hot topic in the computer vision field. This paper proposes a novel deep model for accurate 2D hand pose estimation that combines HOPE-Net, which estimates hand pose, and Mask R-CNN, which provides hand detection and segmentation to localize the hand in the image. First, HOPENet is used to predict the initial 2D hand pose, and the hand features are extracted from an image with a hand in the center, which is cropped from the original image based on Mask RCNN’s output. Then, we combine the initial 2D hand pose and the hand features into a fully connected layer to predict the 2D hand pose correctly. Our experiments show that the proposed model outperforms the original HOPE-Net in 2D hand pose estimation. The proposed method’s mean endpoint error (mEPE) is 48.82 pixels, while the mEPE of the 2D HOPE-Net predictor is 86.30 pixels on the First-Person Hand Action dataset.","PeriodicalId":138642,"journal":{"name":"2022 International Conference on Multimedia Analysis and Pattern Recognition (MAPR)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Improving the Hand Pose Estimation from Egocentric Vision via HOPE-Net and Mask R-CNN\",\"authors\":\"S. Nguyen, Thi-Thu-Hong Le, Hoang-Bach Nguyen, Thanh-Tung Phan, Chi-Thanh Nguyen, Hai Vu\",\"doi\":\"10.1109/MAPR56351.2022.9924768\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Hand pose estimation is the task of predicting the position and orientation of the hand and fingers relative to some coordinate system. It is an important task or input for applications in robotics, medical or human-computer interaction. In recent years, the success of deep convolutional neural networks and the popularity of low-cost consumer wearable cameras have made hand pose estimation on egocentric images using deep neural networks a hot topic in the computer vision field. This paper proposes a novel deep model for accurate 2D hand pose estimation that combines HOPE-Net, which estimates hand pose, and Mask R-CNN, which provides hand detection and segmentation to localize the hand in the image. First, HOPENet is used to predict the initial 2D hand pose, and the hand features are extracted from an image with a hand in the center, which is cropped from the original image based on Mask RCNN’s output. Then, we combine the initial 2D hand pose and the hand features into a fully connected layer to predict the 2D hand pose correctly. Our experiments show that the proposed model outperforms the original HOPE-Net in 2D hand pose estimation. The proposed method’s mean endpoint error (mEPE) is 48.82 pixels, while the mEPE of the 2D HOPE-Net predictor is 86.30 pixels on the First-Person Hand Action dataset.\",\"PeriodicalId\":138642,\"journal\":{\"name\":\"2022 International Conference on Multimedia Analysis and Pattern Recognition (MAPR)\",\"volume\":\"16 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-10-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2022 International Conference on Multimedia Analysis and Pattern Recognition (MAPR)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/MAPR56351.2022.9924768\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 International Conference on Multimedia Analysis and Pattern Recognition (MAPR)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/MAPR56351.2022.9924768","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Improving the Hand Pose Estimation from Egocentric Vision via HOPE-Net and Mask R-CNN
Hand pose estimation is the task of predicting the position and orientation of the hand and fingers relative to some coordinate system. It is an important task or input for applications in robotics, medical or human-computer interaction. In recent years, the success of deep convolutional neural networks and the popularity of low-cost consumer wearable cameras have made hand pose estimation on egocentric images using deep neural networks a hot topic in the computer vision field. This paper proposes a novel deep model for accurate 2D hand pose estimation that combines HOPE-Net, which estimates hand pose, and Mask R-CNN, which provides hand detection and segmentation to localize the hand in the image. First, HOPENet is used to predict the initial 2D hand pose, and the hand features are extracted from an image with a hand in the center, which is cropped from the original image based on Mask RCNN’s output. Then, we combine the initial 2D hand pose and the hand features into a fully connected layer to predict the 2D hand pose correctly. Our experiments show that the proposed model outperforms the original HOPE-Net in 2D hand pose estimation. The proposed method’s mean endpoint error (mEPE) is 48.82 pixels, while the mEPE of the 2D HOPE-Net predictor is 86.30 pixels on the First-Person Hand Action dataset.