Matching Is Not Enough: A Two-Stage Framework for Category-Agnostic Pose Estimation

2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Pub Date : 2023-06-01 DOI:10.1109/CVPR52729.2023.00706

Min Shi, Zihao Huang, Xianzheng Ma, Xiaowei Hu, Zhiguo Cao

{"title":"Matching Is Not Enough: A Two-Stage Framework for Category-Agnostic Pose Estimation","authors":"Min Shi, Zihao Huang, Xianzheng Ma, Xiaowei Hu, Zhiguo Cao","doi":"10.1109/CVPR52729.2023.00706","DOIUrl":null,"url":null,"abstract":"Category-agnostic pose estimation (CAPE) aims to predict keypoints for arbitrary categories given support images with keypoint annotations. Existing approaches match the keypoints across the image for localization. However, such a one-stage matching paradigm shows inferior accuracy: the prediction heavily relies on the matching results, which can be noisy due to the open set nature in CAPE. For example, two mirror-symmetric keypoints (e.g., left and right eyes) in the query image can both trigger high similarity on certain support keypoints (eyes), which leads to duplicated or opposite predictions. To calibrate the inaccurate matching results, we introduce a two-stage framework, where matched keypoints from the first stage are viewed as similarity-aware position proposals. Then, the model learns to fetch relevant features to correct the initial proposals in the second stage. We instantiate the framework with a transformer model tailored for CAPE. The transformer encoder incorporates specific designs to improve the representation and similarity modeling in the first matching stage. In the second stage, similarity-aware proposals are packed as queries in the decoder for refinement via cross-attention. Our method surpasses the previous best approach by large margins on CAPE benchmark MP-100 on both accuracy and efficiency. Code available at github.com/flyinglynx/CapeFormer","PeriodicalId":376416,"journal":{"name":"2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"44 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CVPR52729.2023.00706","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

Abstract

Category-agnostic pose estimation (CAPE) aims to predict keypoints for arbitrary categories given support images with keypoint annotations. Existing approaches match the keypoints across the image for localization. However, such a one-stage matching paradigm shows inferior accuracy: the prediction heavily relies on the matching results, which can be noisy due to the open set nature in CAPE. For example, two mirror-symmetric keypoints (e.g., left and right eyes) in the query image can both trigger high similarity on certain support keypoints (eyes), which leads to duplicated or opposite predictions. To calibrate the inaccurate matching results, we introduce a two-stage framework, where matched keypoints from the first stage are viewed as similarity-aware position proposals. Then, the model learns to fetch relevant features to correct the initial proposals in the second stage. We instantiate the framework with a transformer model tailored for CAPE. The transformer encoder incorporates specific designs to improve the representation and similarity modeling in the first matching stage. In the second stage, similarity-aware proposals are packed as queries in the decoder for refinement via cross-attention. Our method surpasses the previous best approach by large margins on CAPE benchmark MP-100 on both accuracy and efficiency. Code available at github.com/flyinglynx/CapeFormer

查看原文本刊更多论文

匹配是不够的:一个两阶段框架的类别不可知论姿态估计

类别不可知姿态估计(CAPE)的目的是在给定具有关键点注释的支持图像的情况下预测任意类别的关键点。现有的方法匹配整个图像的关键点进行定位。然而，这种单阶段匹配范式的准确性较差:预测严重依赖于匹配结果，由于CAPE的开放集性质，匹配结果可能会有噪声。例如，查询图像中的两个镜像对称的关键点(例如，左眼和右眼)都可以触发某些支持关键点(眼睛)的高度相似性，从而导致重复或相反的预测。为了校正不准确的匹配结果，我们引入了一个两阶段框架，其中第一阶段的匹配关键点被视为相似性感知的位置建议。然后，在第二阶段，模型学习获取相关特征来纠正初始建议。我们使用为CAPE量身定制的转换器模型实例化框架。变压器编码器包含特定的设计，以改善在第一匹配阶段的表示和相似性建模。在第二阶段，相似感知建议被打包为解码器中的查询，以便通过交叉注意进行细化。我们的方法在准确性和效率方面都大大超过了以前在CAPE基准MP-100上的最佳方法。代码可在github.com/flyinglynx/CapeFormer获得。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

自引率

0.00%

发文量