基于关键点交互转换器的结构支持依赖性学习，用于一般哺乳动物姿态估计

IF 9.3 2区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

International Journal of Computer Vision Pub Date : 2025-02-07 DOI:10.1007/s11263-025-02355-0

Tianyang Xu, Jiyong Rao, Xiaoning Song, Zhenhua Feng, Xiao-Jun Wu

{"title":"基于关键点交互转换器的结构支持依赖性学习，用于一般哺乳动物姿态估计","authors":"Tianyang Xu, Jiyong Rao, Xiaoning Song, Zhenhua Feng, Xiao-Jun Wu","doi":"10.1007/s11263-025-02355-0","DOIUrl":null,"url":null,"abstract":"General mammal pose estimation is an important and challenging task in computer vision, which is essential for understanding mammal behaviour in real-world applications. However, existing studies are at their preliminary research stage, which focus on addressing the problem for only a few specific mammal species. In principle, from specific to general mammal pose estimation, the biggest issue is how to address the huge appearance and pose variances for different species. We argue that given appearance context, instance-level prior and the structural relation among keypoints can serve as complementary evidence. To this end, we propose a Keypoint Interactive Transformer (KIT) to learn instance-level structure-supporting dependencies for general mammal pose estimation. Specifically, our KITPose consists of two coupled components. The first component is to extract keypoint features and generate body part prompts. The features are supervised by a dedicated generalised heatmap regression loss (GHRL). Instead of introducing external visual/text prompts, we devise keypoints clustering to generate body part biases, aligning them with image context to generate corresponding instance-level prompts. Second, we propose a novel interactive transformer that takes feature slices as input tokens without performing spatial splitting. In addition, to enhance the capability of the KIT model, we design an adaptive weight strategy to address the imbalance issue among different keypoints. Extensive experimental results obtained on the widely used animal datasets, AP10K and AnimalKingdom, demonstrate the superiority of the proposed method over the state-of-the-art approaches. It achieves 77.9 AP on the AP10K val set, outperforming HRFormer by 2.2. Besides, our KITPose can be directly transferred to human pose estimation with promising results, as evaluated on COCO, reflecting the merits of constructing structure-supporting architectures for general mammal pose estimation.","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"14 1","pages":""},"PeriodicalIF":9.3000,"publicationDate":"2025-02-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Learning Structure-Supporting Dependencies via Keypoint Interactive Transformer for General Mammal Pose Estimation\",\"authors\":\"Tianyang Xu, Jiyong Rao, Xiaoning Song, Zhenhua Feng, Xiao-Jun Wu\",\"doi\":\"10.1007/s11263-025-02355-0\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"General mammal pose estimation is an important and challenging task in computer vision, which is essential for understanding mammal behaviour in real-world applications. However, existing studies are at their preliminary research stage, which focus on addressing the problem for only a few specific mammal species. In principle, from specific to general mammal pose estimation, the biggest issue is how to address the huge appearance and pose variances for different species. We argue that given appearance context, instance-level prior and the structural relation among keypoints can serve as complementary evidence. To this end, we propose a Keypoint Interactive Transformer (KIT) to learn instance-level structure-supporting dependencies for general mammal pose estimation. Specifically, our KITPose consists of two coupled components. The first component is to extract keypoint features and generate body part prompts. The features are supervised by a dedicated generalised heatmap regression loss (GHRL). Instead of introducing external visual/text prompts, we devise keypoints clustering to generate body part biases, aligning them with image context to generate corresponding instance-level prompts. Second, we propose a novel interactive transformer that takes feature slices as input tokens without performing spatial splitting. In addition, to enhance the capability of the KIT model, we design an adaptive weight strategy to address the imbalance issue among different keypoints. Extensive experimental results obtained on the widely used animal datasets, AP10K and AnimalKingdom, demonstrate the superiority of the proposed method over the state-of-the-art approaches. It achieves 77.9 AP on the AP10K val set, outperforming HRFormer by 2.2. Besides, our KITPose can be directly transferred to human pose estimation with promising results, as evaluated on COCO, reflecting the merits of constructing structure-supporting architectures for general mammal pose estimation.\",\"PeriodicalId\":13752,\"journal\":{\"name\":\"International Journal of Computer Vision\",\"volume\":\"14 1\",\"pages\":\"\"},\"PeriodicalIF\":9.3000,\"publicationDate\":\"2025-02-07\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"International Journal of Computer Vision\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/10.1007/s11263-025-02355-0\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Computer Vision","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s11263-025-02355-0","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

一般哺乳动物的姿态估计是计算机视觉中的一项重要且具有挑战性的任务，对于理解现实世界中哺乳动物的行为至关重要。然而，现有的研究还处于初步研究阶段，主要集中在解决少数特定哺乳动物物种的问题。原则上，从特定到一般的哺乳动物姿势估计，最大的问题是如何解决不同物种的巨大外观和姿势差异。我们认为，给定的外观上下文，实例级先验和关键点之间的结构关系可以作为补充证据。为此，我们提出了一个关键点交互转换器（KIT）来学习用于一般哺乳动物姿态估计的实例级结构支持依赖关系。具体来说，我们的KITPose由两个耦合组件组成。第一个组件是提取关键点特征并生成身体部位提示。这些特征由专用的广义热图回归损失（GHRL）进行监督。我们没有引入外部视觉/文本提示，而是设计关键点聚类来生成身体部位偏差，并将它们与图像上下文对齐以生成相应的实例级提示。其次，我们提出了一种新的交互式转换器，它将特征片作为输入令牌而不执行空间分割。此外，为了增强KIT模型的能力，我们设计了一种自适应权重策略来解决不同关键点之间的不平衡问题。在广泛使用的动物数据集AP10K和AnimalKingdom上获得的大量实验结果表明，所提出的方法优于最先进的方法。它在AP10K值设置上达到77.9 AP，比HRFormer高出2.2。此外，我们的KITPose可以直接转移到人类姿态估计中，并得到了很好的结果，这反映了构建结构支持架构用于一般哺乳动物姿态估计的优点。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Learning Structure-Supporting Dependencies via Keypoint Interactive Transformer for General Mammal Pose Estimation

General mammal pose estimation is an important and challenging task in computer vision, which is essential for understanding mammal behaviour in real-world applications. However, existing studies are at their preliminary research stage, which focus on addressing the problem for only a few specific mammal species. In principle, from specific to general mammal pose estimation, the biggest issue is how to address the huge appearance and pose variances for different species. We argue that given appearance context, instance-level prior and the structural relation among keypoints can serve as complementary evidence. To this end, we propose a Keypoint Interactive Transformer (KIT) to learn instance-level structure-supporting dependencies for general mammal pose estimation. Specifically, our KITPose consists of two coupled components. The first component is to extract keypoint features and generate body part prompts. The features are supervised by a dedicated generalised heatmap regression loss (GHRL). Instead of introducing external visual/text prompts, we devise keypoints clustering to generate body part biases, aligning them with image context to generate corresponding instance-level prompts. Second, we propose a novel interactive transformer that takes feature slices as input tokens without performing spatial splitting. In addition, to enhance the capability of the KIT model, we design an adaptive weight strategy to address the imbalance issue among different keypoints. Extensive experimental results obtained on the widely used animal datasets, AP10K and AnimalKingdom, demonstrate the superiority of the proposed method over the state-of-the-art approaches. It achieves 77.9 AP on the AP10K val set, outperforming HRFormer by 2.2. Besides, our KITPose can be directly transferred to human pose estimation with promising results, as evaluated on COCO, reflecting the merits of constructing structure-supporting architectures for general mammal pose estimation.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

International Journal of Computer Vision 工程技术-计算机：人工智能

CiteScore

29.80

自引率

2.10%

发文量

163

审稿时长

6 months

期刊介绍： The International Journal of Computer Vision (IJCV) serves as a platform for sharing new research findings in the rapidly growing field of computer vision. It publishes 12 issues annually and presents high-quality, original contributions to the science and engineering of computer vision. The journal encompasses various types of articles to cater to different research outputs. Regular articles, which span up to 25 journal pages, focus on significant technical advancements that are of broad interest to the field. These articles showcase substantial progress in computer vision. Short articles, limited to 10 pages, offer a swift publication path for novel research outcomes. They provide a quicker means for sharing new findings with the computer vision community. Survey articles, comprising up to 30 pages, offer critical evaluations of the current state of the art in computer vision or offer tutorial presentations of relevant topics. These articles provide comprehensive and insightful overviews of specific subject areas. In addition to technical articles, the journal also includes book reviews, position papers, and editorials by prominent scientific figures. These contributions serve to complement the technical content and provide valuable perspectives. The journal encourages authors to include supplementary material online, such as images, video sequences, data sets, and software. This additional material enhances the understanding and reproducibility of the published research. Overall, the International Journal of Computer Vision is a comprehensive publication that caters to researchers in this rapidly growing field. It covers a range of article types, offers additional online resources, and facilitates the dissemination of impactful research.