{"title":"用于人体姿态估计的双边姿态转换器","authors":"Chia-Chen Yen, Tao Pin, Hongmin Xu","doi":"10.1145/3532342.3532346","DOIUrl":null,"url":null,"abstract":"Human Pose is a well-defined fundamental task researched by the computer vision community for years. Previous Convolutional Neural Network (CNN) based works have achieved significant success in the human pose. Recently, Vision Transformer (VT) has shown superior performance on computer vision tasks. However, current VT methods emphasize local information less and often focus on only a single scale feature that may not be suitable for dense image prediction tasks, which essentially requires multi-scale representations. In this paper, we propose a novel Bilateral Pose Transformer (BPT) framework to handle the human pose. Specifically, BPT consists of an innovated bilateral branch encoder and a multi-scale integrating decoder. The bilateral branch encoder contains a Context Branch (CB) and Spatial Branch (SB). The CB involves a VT-based backbone to capture the context clues and produce multi-scale context features. The CNN-based SB maintains high-resolution representations containing rich spatial information to introduce the local spatial information that supplements the CB explicitly. About the decoder, a Mixed Feature Module consisting of local attention CNN is proposed to integrate the various-scale context and spatial features effectively. Experiments demonstrate that our approach achieves competitive performances in human pose estimation. Specifically, compared to the HRNet [1], the BPT saves 43% GFLOPs and drops only 0.1 points AP, achieving 75.7% AP with 9.0 GFLOPs, on the COCO keypoints dataset.","PeriodicalId":398859,"journal":{"name":"Proceedings of the 4th International Symposium on Signal Processing Systems","volume":"24 2","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-03-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Bilateral Pose Transformer for Human Pose Estimation\",\"authors\":\"Chia-Chen Yen, Tao Pin, Hongmin Xu\",\"doi\":\"10.1145/3532342.3532346\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Human Pose is a well-defined fundamental task researched by the computer vision community for years. Previous Convolutional Neural Network (CNN) based works have achieved significant success in the human pose. Recently, Vision Transformer (VT) has shown superior performance on computer vision tasks. However, current VT methods emphasize local information less and often focus on only a single scale feature that may not be suitable for dense image prediction tasks, which essentially requires multi-scale representations. In this paper, we propose a novel Bilateral Pose Transformer (BPT) framework to handle the human pose. Specifically, BPT consists of an innovated bilateral branch encoder and a multi-scale integrating decoder. The bilateral branch encoder contains a Context Branch (CB) and Spatial Branch (SB). The CB involves a VT-based backbone to capture the context clues and produce multi-scale context features. The CNN-based SB maintains high-resolution representations containing rich spatial information to introduce the local spatial information that supplements the CB explicitly. About the decoder, a Mixed Feature Module consisting of local attention CNN is proposed to integrate the various-scale context and spatial features effectively. Experiments demonstrate that our approach achieves competitive performances in human pose estimation. Specifically, compared to the HRNet [1], the BPT saves 43% GFLOPs and drops only 0.1 points AP, achieving 75.7% AP with 9.0 GFLOPs, on the COCO keypoints dataset.\",\"PeriodicalId\":398859,\"journal\":{\"name\":\"Proceedings of the 4th International Symposium on Signal Processing Systems\",\"volume\":\"24 2\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-03-25\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 4th International Symposium on Signal Processing Systems\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3532342.3532346\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 4th International Symposium on Signal Processing Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3532342.3532346","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Bilateral Pose Transformer for Human Pose Estimation
Human Pose is a well-defined fundamental task researched by the computer vision community for years. Previous Convolutional Neural Network (CNN) based works have achieved significant success in the human pose. Recently, Vision Transformer (VT) has shown superior performance on computer vision tasks. However, current VT methods emphasize local information less and often focus on only a single scale feature that may not be suitable for dense image prediction tasks, which essentially requires multi-scale representations. In this paper, we propose a novel Bilateral Pose Transformer (BPT) framework to handle the human pose. Specifically, BPT consists of an innovated bilateral branch encoder and a multi-scale integrating decoder. The bilateral branch encoder contains a Context Branch (CB) and Spatial Branch (SB). The CB involves a VT-based backbone to capture the context clues and produce multi-scale context features. The CNN-based SB maintains high-resolution representations containing rich spatial information to introduce the local spatial information that supplements the CB explicitly. About the decoder, a Mixed Feature Module consisting of local attention CNN is proposed to integrate the various-scale context and spatial features effectively. Experiments demonstrate that our approach achieves competitive performances in human pose estimation. Specifically, compared to the HRNet [1], the BPT saves 43% GFLOPs and drops only 0.1 points AP, achieving 75.7% AP with 9.0 GFLOPs, on the COCO keypoints dataset.