{"title":"基于通道和空间注意的人体姿态估计研究","authors":"Yilong Liu","doi":"10.1109/ICCECE58074.2023.10135500","DOIUrl":null,"url":null,"abstract":"Accurate pose estimation is crucial for understanding human behavior in images or videos. Given an RGB image, we want to be able to accurately locate some important keypoints on the body. Understanding human pose and body structure is important for high-level tasks such as human-computer interaction. Human pose estimation usually has problems such as low discrimination between human body and background, and human pose estimation based on HRnet network does not make full use of important feature information. To solve these problems, a human pose estimation method MCSA-hrnet (Multi-scale Channel and Spatial Attention) based on multi-scale channel and spatial attention is improved by using channel attention mechanism and spatial attention mechanism. Starting from the channel domain and spatial domain, MCSA-HRnet integrates the multi-level attention mechanism into the high-resolution network structure, and designs the channel attention block and spatial attention block. This enables the network to focus on the regions of the image that are highly associated with the human body and not on other regions. MCSA-HRnet uses 1×1 convolutions for information extraction in the core part of the ca block (channel attention block) and parallel $\\boldsymbol{3\\mathrm{x}3}$ and $\\boldsymbol{5\\mathrm{x}5}$ convolutions in the sa block (spatial attention block). Different sizes of parallel convolutions can derive spatial attention maps of different scales, which makes the ability of the network to distinguish human features from background features more significant. Thus, the human body region and its key points can be accurately located. The improved method is verified on COCO keypoint dataset, and the results show that MCSA-HRnet can effectively improve the accuracy of human pose estimation joint point localization.","PeriodicalId":120030,"journal":{"name":"2023 3rd International Conference on Consumer Electronics and Computer Engineering (ICCECE)","volume":"32 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Study on human pose estimation based on channel and spatial attention\",\"authors\":\"Yilong Liu\",\"doi\":\"10.1109/ICCECE58074.2023.10135500\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Accurate pose estimation is crucial for understanding human behavior in images or videos. Given an RGB image, we want to be able to accurately locate some important keypoints on the body. Understanding human pose and body structure is important for high-level tasks such as human-computer interaction. Human pose estimation usually has problems such as low discrimination between human body and background, and human pose estimation based on HRnet network does not make full use of important feature information. To solve these problems, a human pose estimation method MCSA-hrnet (Multi-scale Channel and Spatial Attention) based on multi-scale channel and spatial attention is improved by using channel attention mechanism and spatial attention mechanism. Starting from the channel domain and spatial domain, MCSA-HRnet integrates the multi-level attention mechanism into the high-resolution network structure, and designs the channel attention block and spatial attention block. This enables the network to focus on the regions of the image that are highly associated with the human body and not on other regions. MCSA-HRnet uses 1×1 convolutions for information extraction in the core part of the ca block (channel attention block) and parallel $\\\\boldsymbol{3\\\\mathrm{x}3}$ and $\\\\boldsymbol{5\\\\mathrm{x}5}$ convolutions in the sa block (spatial attention block). Different sizes of parallel convolutions can derive spatial attention maps of different scales, which makes the ability of the network to distinguish human features from background features more significant. Thus, the human body region and its key points can be accurately located. The improved method is verified on COCO keypoint dataset, and the results show that MCSA-HRnet can effectively improve the accuracy of human pose estimation joint point localization.\",\"PeriodicalId\":120030,\"journal\":{\"name\":\"2023 3rd International Conference on Consumer Electronics and Computer Engineering (ICCECE)\",\"volume\":\"32 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-01-06\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2023 3rd International Conference on Consumer Electronics and Computer Engineering (ICCECE)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICCECE58074.2023.10135500\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 3rd International Conference on Consumer Electronics and Computer Engineering (ICCECE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICCECE58074.2023.10135500","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
摘要
准确的姿态估计对于理解图像或视频中的人类行为至关重要。给定一个RGB图像,我们希望能够准确地定位身体上的一些重要关键点。了解人体姿势和身体结构对于人机交互等高级任务非常重要。人体姿态估计通常存在人体与背景识别率低、基于HRnet网络的人体姿态估计没有充分利用重要特征信息等问题。针对这些问题,利用通道注意机制和空间注意机制对基于多尺度通道和空间注意的人体姿态估计方法MCSA-hrnet (Multi-scale Channel and Spatial Attention)进行了改进。MCSA-HRnet从通道域和空间域出发,将多层次注意机制集成到高分辨率网络结构中,设计了通道注意块和空间注意块。这使得网络能够专注于图像中与人体高度相关的区域,而不是其他区域。MCSA-HRnet在ca块(通道注意力块)的核心部分使用1×1卷积进行信息提取,并在sa块(空间注意力块)中并行使用$\boldsymbol{3\mathrm{x}3}$和$\boldsymbol{5\mathrm{x}5}$卷积。不同大小的并行卷积可以得到不同尺度的空间注意图,这使得网络区分人类特征和背景特征的能力更加显著。从而准确定位人体区域及其关键点。在COCO关键点数据集上对改进方法进行了验证,结果表明MCSA-HRnet可以有效提高人体姿态估计关节点定位的精度。
Study on human pose estimation based on channel and spatial attention
Accurate pose estimation is crucial for understanding human behavior in images or videos. Given an RGB image, we want to be able to accurately locate some important keypoints on the body. Understanding human pose and body structure is important for high-level tasks such as human-computer interaction. Human pose estimation usually has problems such as low discrimination between human body and background, and human pose estimation based on HRnet network does not make full use of important feature information. To solve these problems, a human pose estimation method MCSA-hrnet (Multi-scale Channel and Spatial Attention) based on multi-scale channel and spatial attention is improved by using channel attention mechanism and spatial attention mechanism. Starting from the channel domain and spatial domain, MCSA-HRnet integrates the multi-level attention mechanism into the high-resolution network structure, and designs the channel attention block and spatial attention block. This enables the network to focus on the regions of the image that are highly associated with the human body and not on other regions. MCSA-HRnet uses 1×1 convolutions for information extraction in the core part of the ca block (channel attention block) and parallel $\boldsymbol{3\mathrm{x}3}$ and $\boldsymbol{5\mathrm{x}5}$ convolutions in the sa block (spatial attention block). Different sizes of parallel convolutions can derive spatial attention maps of different scales, which makes the ability of the network to distinguish human features from background features more significant. Thus, the human body region and its key points can be accurately located. The improved method is verified on COCO keypoint dataset, and the results show that MCSA-HRnet can effectively improve the accuracy of human pose estimation joint point localization.