基于时空交叉注意的三维人体姿态估计

2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Pub Date : 2023-06-01 DOI:10.1109/CVPR52729.2023.00464

Z. Tang, Zhaofan Qiu, Y. Hao, Richang Hong, Ting Yao

{"title":"基于时空交叉注意的三维人体姿态估计","authors":"Z. Tang, Zhaofan Qiu, Y. Hao, Richang Hong, Ting Yao","doi":"10.1109/CVPR52729.2023.00464","DOIUrl":null,"url":null,"abstract":"Recent transformer-based solutions have shown great success in 3D human pose estimation. Nevertheless, to calculate the joint-to-joint affinity matrix, the computational cost has a quadratic growth with the increasing number of joints. Such drawback becomes even worse especially for pose estimation in a video sequence, which necessitates spatio-temporal correlation spanning over the entire video. In this paper, we facilitate the issue by decomposing correlation learning into space and time, and present a novel Spatio-Temporal Criss-cross attention (STC) block. Technically, STC first slices its input feature into two partitions evenly along the channel dimension, followed by performing spatial and temporal attention respectively on each partition. STC then models the interactions between joints in an identical frame and joints in an identical trajectory simultaneously by concatenating the outputs from attention layers. On this basis, we devise STCFormer by stacking multiple STC blocks and further integrate a new Structure-enhanced Positional Embedding (SPE) into STCFormer to take the structure of human body into consideration. The embedding function consists of two components: spatio-temporal convolution around neighboring joints to capture local structure, and part-aware embedding to indicate which part each joint belongs to. Extensive experiments are conducted on Human3.6M and MPI-INF-3DHP benchmarks, and superior results are reported when comparing to the state-of-the-art approaches. More remarkably, STCFormer achieves to-date the best published performance: 40.5mm P1 error on the challenging Human3.6M dataset.","PeriodicalId":376416,"journal":{"name":"2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"7 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"10","resultStr":"{\"title\":\"3D Human Pose Estimation with Spatio-Temporal Criss-Cross Attention\",\"authors\":\"Z. Tang, Zhaofan Qiu, Y. Hao, Richang Hong, Ting Yao\",\"doi\":\"10.1109/CVPR52729.2023.00464\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Recent transformer-based solutions have shown great success in 3D human pose estimation. Nevertheless, to calculate the joint-to-joint affinity matrix, the computational cost has a quadratic growth with the increasing number of joints. Such drawback becomes even worse especially for pose estimation in a video sequence, which necessitates spatio-temporal correlation spanning over the entire video. In this paper, we facilitate the issue by decomposing correlation learning into space and time, and present a novel Spatio-Temporal Criss-cross attention (STC) block. Technically, STC first slices its input feature into two partitions evenly along the channel dimension, followed by performing spatial and temporal attention respectively on each partition. STC then models the interactions between joints in an identical frame and joints in an identical trajectory simultaneously by concatenating the outputs from attention layers. On this basis, we devise STCFormer by stacking multiple STC blocks and further integrate a new Structure-enhanced Positional Embedding (SPE) into STCFormer to take the structure of human body into consideration. The embedding function consists of two components: spatio-temporal convolution around neighboring joints to capture local structure, and part-aware embedding to indicate which part each joint belongs to. Extensive experiments are conducted on Human3.6M and MPI-INF-3DHP benchmarks, and superior results are reported when comparing to the state-of-the-art approaches. More remarkably, STCFormer achieves to-date the best published performance: 40.5mm P1 error on the challenging Human3.6M dataset.\",\"PeriodicalId\":376416,\"journal\":{\"name\":\"2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)\",\"volume\":\"7 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-06-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"10\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/CVPR52729.2023.00464\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CVPR52729.2023.00464","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 10

摘要

最近基于变压器的解决方案在3D人体姿态估计中取得了巨大成功。然而，在计算关节间亲和矩阵时，计算成本随着关节数量的增加呈二次增长。特别是对于视频序列中的姿态估计，这种缺点变得更加严重，这需要跨越整个视频的时空相关性。在本文中，我们将相关学习分解为空间和时间，并提出了一种新的时空交叉注意(STC)块。从技术上讲，STC首先沿通道维度将其输入特征均匀地分成两个分区，然后分别对每个分区进行空间和时间关注。然后，STC通过连接来自注意层的输出，同时对同一框架中的关节和同一轨迹中的关节之间的相互作用进行建模。在此基础上，我们通过堆叠多个STC块来设计STCFormer，并进一步将一种新的结构增强位置嵌入(SPE)集成到STCFormer中，以考虑人体的结构。嵌入函数由两个部分组成:一个是相邻节点的时空卷积，用于捕获局部结构;另一个是部件感知嵌入，用于指示每个节点属于哪个部件。在Human3.6M和MPI-INF-3DHP基准上进行了广泛的实验，与最先进的方法相比，报告了更好的结果。更值得注意的是，STCFormer在具有挑战性的human360万数据集上实现了迄今为止公布的最佳性能:40.5mm P1误差。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

3D Human Pose Estimation with Spatio-Temporal Criss-Cross Attention

Recent transformer-based solutions have shown great success in 3D human pose estimation. Nevertheless, to calculate the joint-to-joint affinity matrix, the computational cost has a quadratic growth with the increasing number of joints. Such drawback becomes even worse especially for pose estimation in a video sequence, which necessitates spatio-temporal correlation spanning over the entire video. In this paper, we facilitate the issue by decomposing correlation learning into space and time, and present a novel Spatio-Temporal Criss-cross attention (STC) block. Technically, STC first slices its input feature into two partitions evenly along the channel dimension, followed by performing spatial and temporal attention respectively on each partition. STC then models the interactions between joints in an identical frame and joints in an identical trajectory simultaneously by concatenating the outputs from attention layers. On this basis, we devise STCFormer by stacking multiple STC blocks and further integrate a new Structure-enhanced Positional Embedding (SPE) into STCFormer to take the structure of human body into consideration. The embedding function consists of two components: spatio-temporal convolution around neighboring joints to capture local structure, and part-aware embedding to indicate which part each joint belongs to. Extensive experiments are conducted on Human3.6M and MPI-INF-3DHP benchmarks, and superior results are reported when comparing to the state-of-the-art approaches. More remarkably, STCFormer achieves to-date the best published performance: 40.5mm P1 error on the challenging Human3.6M dataset.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

自引率

0.00%

发文量