{"title":"基于自注意网络的非裁剪RGB图像双手姿态估计","authors":"Zhoutao Sun, Yong Hu, Xukun Shen","doi":"10.1109/ismar52148.2021.00040","DOIUrl":null,"url":null,"abstract":"Estimating the pose of two hands is a crucial problem for many human-computer interaction applications. Since most of the existing works utilize cropped images to predict the hand pose, they require a hand detection stage before pose estimation or input cropped images directly. In this paper, we propose the first real-time one-stage method for pose estimation from a single RGB image without hand tracking. Combining the self-attention mechanism with convolutional layers, the network we proposed is able to predict the 2.5D hand joints coordinate while locating the two hands regions. And to reduce the extra memory and computational consumption caused by self-attention, we proposed a linear attention structure with a spatial reduction attention block called SRAN block. We demonstrate the effectiveness of each component in our network through the ablation study. And experiments on public datasets showed the competitive result with the state-of-the-art method.","PeriodicalId":395413,"journal":{"name":"2021 IEEE International Symposium on Mixed and Augmented Reality (ISMAR)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Two-hand Pose Estimation from the non-cropped RGB Image with Self-Attention Based Network\",\"authors\":\"Zhoutao Sun, Yong Hu, Xukun Shen\",\"doi\":\"10.1109/ismar52148.2021.00040\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Estimating the pose of two hands is a crucial problem for many human-computer interaction applications. Since most of the existing works utilize cropped images to predict the hand pose, they require a hand detection stage before pose estimation or input cropped images directly. In this paper, we propose the first real-time one-stage method for pose estimation from a single RGB image without hand tracking. Combining the self-attention mechanism with convolutional layers, the network we proposed is able to predict the 2.5D hand joints coordinate while locating the two hands regions. And to reduce the extra memory and computational consumption caused by self-attention, we proposed a linear attention structure with a spatial reduction attention block called SRAN block. We demonstrate the effectiveness of each component in our network through the ablation study. And experiments on public datasets showed the competitive result with the state-of-the-art method.\",\"PeriodicalId\":395413,\"journal\":{\"name\":\"2021 IEEE International Symposium on Mixed and Augmented Reality (ISMAR)\",\"volume\":\"15 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-10-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2021 IEEE International Symposium on Mixed and Augmented Reality (ISMAR)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ismar52148.2021.00040\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 IEEE International Symposium on Mixed and Augmented Reality (ISMAR)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ismar52148.2021.00040","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Two-hand Pose Estimation from the non-cropped RGB Image with Self-Attention Based Network
Estimating the pose of two hands is a crucial problem for many human-computer interaction applications. Since most of the existing works utilize cropped images to predict the hand pose, they require a hand detection stage before pose estimation or input cropped images directly. In this paper, we propose the first real-time one-stage method for pose estimation from a single RGB image without hand tracking. Combining the self-attention mechanism with convolutional layers, the network we proposed is able to predict the 2.5D hand joints coordinate while locating the two hands regions. And to reduce the extra memory and computational consumption caused by self-attention, we proposed a linear attention structure with a spatial reduction attention block called SRAN block. We demonstrate the effectiveness of each component in our network through the ablation study. And experiments on public datasets showed the competitive result with the state-of-the-art method.