用于人体姿态估计的双边姿态转换器

Proceedings of the 4th International Symposium on Signal Processing Systems Pub Date : 2022-03-25 DOI:10.1145/3532342.3532346

Chia-Chen Yen, Tao Pin, Hongmin Xu

{"title":"用于人体姿态估计的双边姿态转换器","authors":"Chia-Chen Yen, Tao Pin, Hongmin Xu","doi":"10.1145/3532342.3532346","DOIUrl":null,"url":null,"abstract":"Human Pose is a well-defined fundamental task researched by the computer vision community for years. Previous Convolutional Neural Network (CNN) based works have achieved significant success in the human pose. Recently, Vision Transformer (VT) has shown superior performance on computer vision tasks. However, current VT methods emphasize local information less and often focus on only a single scale feature that may not be suitable for dense image prediction tasks, which essentially requires multi-scale representations. In this paper, we propose a novel Bilateral Pose Transformer (BPT) framework to handle the human pose. Specifically, BPT consists of an innovated bilateral branch encoder and a multi-scale integrating decoder. The bilateral branch encoder contains a Context Branch (CB) and Spatial Branch (SB). The CB involves a VT-based backbone to capture the context clues and produce multi-scale context features. The CNN-based SB maintains high-resolution representations containing rich spatial information to introduce the local spatial information that supplements the CB explicitly. About the decoder, a Mixed Feature Module consisting of local attention CNN is proposed to integrate the various-scale context and spatial features effectively. Experiments demonstrate that our approach achieves competitive performances in human pose estimation. Specifically, compared to the HRNet [1], the BPT saves 43% GFLOPs and drops only 0.1 points AP, achieving 75.7% AP with 9.0 GFLOPs, on the COCO keypoints dataset.","PeriodicalId":398859,"journal":{"name":"Proceedings of the 4th International Symposium on Signal Processing Systems","volume":"24 2","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-03-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Bilateral Pose Transformer for Human Pose Estimation\",\"authors\":\"Chia-Chen Yen, Tao Pin, Hongmin Xu\",\"doi\":\"10.1145/3532342.3532346\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Human Pose is a well-defined fundamental task researched by the computer vision community for years. Previous Convolutional Neural Network (CNN) based works have achieved significant success in the human pose. Recently, Vision Transformer (VT) has shown superior performance on computer vision tasks. However, current VT methods emphasize local information less and often focus on only a single scale feature that may not be suitable for dense image prediction tasks, which essentially requires multi-scale representations. In this paper, we propose a novel Bilateral Pose Transformer (BPT) framework to handle the human pose. Specifically, BPT consists of an innovated bilateral branch encoder and a multi-scale integrating decoder. The bilateral branch encoder contains a Context Branch (CB) and Spatial Branch (SB). The CB involves a VT-based backbone to capture the context clues and produce multi-scale context features. The CNN-based SB maintains high-resolution representations containing rich spatial information to introduce the local spatial information that supplements the CB explicitly. About the decoder, a Mixed Feature Module consisting of local attention CNN is proposed to integrate the various-scale context and spatial features effectively. Experiments demonstrate that our approach achieves competitive performances in human pose estimation. Specifically, compared to the HRNet [1], the BPT saves 43% GFLOPs and drops only 0.1 points AP, achieving 75.7% AP with 9.0 GFLOPs, on the COCO keypoints dataset.\",\"PeriodicalId\":398859,\"journal\":{\"name\":\"Proceedings of the 4th International Symposium on Signal Processing Systems\",\"volume\":\"24 2\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-03-25\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 4th International Symposium on Signal Processing Systems\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3532342.3532346\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 4th International Symposium on Signal Processing Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3532342.3532346","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

人体姿态是计算机视觉界多年来研究的一个定义明确的基础任务。以前基于卷积神经网络(CNN)的工作已经在人体姿势方面取得了重大成功。近年来，视觉变压器(VT)在计算机视觉任务中表现出优异的性能。然而，目前的VT方法较少强调局部信息，往往只关注单一尺度特征，这可能不适合密集图像预测任务，这本质上需要多尺度表示。在本文中，我们提出了一种新的双边姿态变换(BPT)框架来处理人体姿态。具体来说，BPT由一个创新的双边分支编码器和一个多尺度积分解码器组成。双边分支编码器包含上下文分支(CB)和空间分支(SB)。该方法利用基于视觉的主干来捕获上下文线索并生成多尺度上下文特征。基于cnn的SB保持包含丰富空间信息的高分辨率表示，以引入明确补充CB的局部空间信息。在解码器方面，提出了一种由局部关注CNN组成的混合特征模块，有效地整合了不同尺度的上下文和空间特征。实验表明，我们的方法在人体姿态估计中取得了很好的效果。具体来说，与HRNet[1]相比，在COCO关键点数据集上，BPT节省了43%的GFLOPs，仅降低了0.1点AP，实现了75.7%的AP和9.0个GFLOPs。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Bilateral Pose Transformer for Human Pose Estimation

Human Pose is a well-defined fundamental task researched by the computer vision community for years. Previous Convolutional Neural Network (CNN) based works have achieved significant success in the human pose. Recently, Vision Transformer (VT) has shown superior performance on computer vision tasks. However, current VT methods emphasize local information less and often focus on only a single scale feature that may not be suitable for dense image prediction tasks, which essentially requires multi-scale representations. In this paper, we propose a novel Bilateral Pose Transformer (BPT) framework to handle the human pose. Specifically, BPT consists of an innovated bilateral branch encoder and a multi-scale integrating decoder. The bilateral branch encoder contains a Context Branch (CB) and Spatial Branch (SB). The CB involves a VT-based backbone to capture the context clues and produce multi-scale context features. The CNN-based SB maintains high-resolution representations containing rich spatial information to introduce the local spatial information that supplements the CB explicitly. About the decoder, a Mixed Feature Module consisting of local attention CNN is proposed to integrate the various-scale context and spatial features effectively. Experiments demonstrate that our approach achieves competitive performances in human pose estimation. Specifically, compared to the HRNet [1], the BPT saves 43% GFLOPs and drops only 0.1 points AP, achieving 75.7% AP with 9.0 GFLOPs, on the COCO keypoints dataset.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 4th International Symposium on Signal Processing Systems

自引率

0.00%

发文量