Matching Is Not Enough: A Two-Stage Framework for Category-Agnostic Pose Estimation

Min Shi, Zihao Huang, Xianzheng Ma, Xiaowei Hu, Zhiguo Cao
{"title":"Matching Is Not Enough: A Two-Stage Framework for Category-Agnostic Pose Estimation","authors":"Min Shi, Zihao Huang, Xianzheng Ma, Xiaowei Hu, Zhiguo Cao","doi":"10.1109/CVPR52729.2023.00706","DOIUrl":null,"url":null,"abstract":"Category-agnostic pose estimation (CAPE) aims to predict keypoints for arbitrary categories given support images with keypoint annotations. Existing approaches match the keypoints across the image for localization. However, such a one-stage matching paradigm shows inferior accuracy: the prediction heavily relies on the matching results, which can be noisy due to the open set nature in CAPE. For example, two mirror-symmetric keypoints (e.g., left and right eyes) in the query image can both trigger high similarity on certain support keypoints (eyes), which leads to duplicated or opposite predictions. To calibrate the inaccurate matching results, we introduce a two-stage framework, where matched keypoints from the first stage are viewed as similarity-aware position proposals. Then, the model learns to fetch relevant features to correct the initial proposals in the second stage. We instantiate the framework with a transformer model tailored for CAPE. The transformer encoder incorporates specific designs to improve the representation and similarity modeling in the first matching stage. In the second stage, similarity-aware proposals are packed as queries in the decoder for refinement via cross-attention. Our method surpasses the previous best approach by large margins on CAPE benchmark MP-100 on both accuracy and efficiency. Code available at github.com/flyinglynx/CapeFormer","PeriodicalId":376416,"journal":{"name":"2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"44 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CVPR52729.2023.00706","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2

Abstract

Category-agnostic pose estimation (CAPE) aims to predict keypoints for arbitrary categories given support images with keypoint annotations. Existing approaches match the keypoints across the image for localization. However, such a one-stage matching paradigm shows inferior accuracy: the prediction heavily relies on the matching results, which can be noisy due to the open set nature in CAPE. For example, two mirror-symmetric keypoints (e.g., left and right eyes) in the query image can both trigger high similarity on certain support keypoints (eyes), which leads to duplicated or opposite predictions. To calibrate the inaccurate matching results, we introduce a two-stage framework, where matched keypoints from the first stage are viewed as similarity-aware position proposals. Then, the model learns to fetch relevant features to correct the initial proposals in the second stage. We instantiate the framework with a transformer model tailored for CAPE. The transformer encoder incorporates specific designs to improve the representation and similarity modeling in the first matching stage. In the second stage, similarity-aware proposals are packed as queries in the decoder for refinement via cross-attention. Our method surpasses the previous best approach by large margins on CAPE benchmark MP-100 on both accuracy and efficiency. Code available at github.com/flyinglynx/CapeFormer
匹配是不够的:一个两阶段框架的类别不可知论姿态估计
类别不可知姿态估计(CAPE)的目的是在给定具有关键点注释的支持图像的情况下预测任意类别的关键点。现有的方法匹配整个图像的关键点进行定位。然而,这种单阶段匹配范式的准确性较差:预测严重依赖于匹配结果,由于CAPE的开放集性质,匹配结果可能会有噪声。例如,查询图像中的两个镜像对称的关键点(例如,左眼和右眼)都可以触发某些支持关键点(眼睛)的高度相似性,从而导致重复或相反的预测。为了校正不准确的匹配结果,我们引入了一个两阶段框架,其中第一阶段的匹配关键点被视为相似性感知的位置建议。然后,在第二阶段,模型学习获取相关特征来纠正初始建议。我们使用为CAPE量身定制的转换器模型实例化框架。变压器编码器包含特定的设计,以改善在第一匹配阶段的表示和相似性建模。在第二阶段,相似感知建议被打包为解码器中的查询,以便通过交叉注意进行细化。我们的方法在准确性和效率方面都大大超过了以前在CAPE基准MP-100上的最佳方法。代码可在github.com/flyinglynx/CapeFormer获得。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信