RelPose-TTA: Energy-based relative pose correction for test-time adaptation of category-level object pose estimation

IF 4.2 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Image and Vision Computing Pub Date : 2026-04-01 Epub Date: 2026-02-07 DOI:10.1016/j.imavis.2026.105928

Yue Zhan , Xin Wang , Zhaoxiang Liu , Shiguo Lian , Tangwen Yang

{"title":"RelPose-TTA: Energy-based relative pose correction for test-time adaptation of category-level object pose estimation","authors":"Yue Zhan , Xin Wang , Zhaoxiang Liu , Shiguo Lian , Tangwen Yang","doi":"10.1016/j.imavis.2026.105928","DOIUrl":null,"url":null,"abstract":"<div><div>Category-level object pose estimation is fundamental for robotic grasping and manipulation, yet models trained on synthetic data often generalize poorly to real-world environments due to substantial domain gaps. Test-time adaptation (TTA) offers a promising solution to address this challenge, but existing methods frequently depend on noisy pseudo-labels or complex optimization, which can lead to performance degradation and error accumulation over time. In this paper, we propose RelPose-TTA, a test-time adaptation framework that improves the generalization and long-term stability for category-level object pose estimation in previously unseen real-world environments. The core idea is to exploit the relative motion between consecutive frames, which is typically more stable and reliable than single-frame absolute pose estimation, and to use it as a self-supervisory signal during inference. Concretely, RelPose-TTA introduces an energy-based relative pose corrector to model inter-frame motion and mitigate ambiguities induced by occlusions, object symmetries, and large viewpoint changes. During test-time adaptation, the corrector is updated online via contrastive learning and is tightly coupled with point cloud registration, so that refined relative pose estimates can effectively guide absolute pose refinement. Extensive experiments demonstrate that RelPose-TTA consistently outperforms prior TTA methods in unseen real-world settings, while substantially reducing long-term drift and maintaining stable pose predictions.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"168 ","pages":"Article 105928"},"PeriodicalIF":4.2000,"publicationDate":"2026-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Image and Vision Computing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S026288562600034X","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2026/2/7 0:00:00","PubModel":"Epub","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Category-level object pose estimation is fundamental for robotic grasping and manipulation, yet models trained on synthetic data often generalize poorly to real-world environments due to substantial domain gaps. Test-time adaptation (TTA) offers a promising solution to address this challenge, but existing methods frequently depend on noisy pseudo-labels or complex optimization, which can lead to performance degradation and error accumulation over time. In this paper, we propose RelPose-TTA, a test-time adaptation framework that improves the generalization and long-term stability for category-level object pose estimation in previously unseen real-world environments. The core idea is to exploit the relative motion between consecutive frames, which is typically more stable and reliable than single-frame absolute pose estimation, and to use it as a self-supervisory signal during inference. Concretely, RelPose-TTA introduces an energy-based relative pose corrector to model inter-frame motion and mitigate ambiguities induced by occlusions, object symmetries, and large viewpoint changes. During test-time adaptation, the corrector is updated online via contrastive learning and is tightly coupled with point cloud registration, so that refined relative pose estimates can effectively guide absolute pose refinement. Extensive experiments demonstrate that RelPose-TTA consistently outperforms prior TTA methods in unseen real-world settings, while substantially reducing long-term drift and maintaining stable pose predictions.

查看原文本刊更多论文

RelPose-TTA：基于能量的相对姿态校正，用于类别级目标姿态估计的测试时间适应

类别级对象姿态估计是机器人抓取和操作的基础，但由于存在大量的领域差距，在合成数据上训练的模型往往不能很好地推广到现实环境。测试时间自适应（TTA）为解决这一挑战提供了一个很有前途的解决方案，但是现有的方法经常依赖于有噪声的伪标签或复杂的优化，这会导致性能下降和错误积累。在本文中，我们提出了RelPose-TTA，这是一个测试时间自适应框架，它提高了在以前未见过的真实环境中分类级对象姿态估计的泛化和长期稳定性。其核心思想是利用连续帧之间的相对运动，这通常比单帧绝对姿态估计更稳定和可靠，并将其用作推理过程中的自监督信号。具体来说，RelPose-TTA引入了一个基于能量的相对姿态校正器来模拟帧间运动，并减轻由遮挡、物体对称和大视点变化引起的模糊。在测试时间自适应过程中，校正器通过对比学习在线更新，并与点云配准紧密耦合，使得精细的相对姿态估计可以有效地指导绝对姿态的精细。大量实验表明，在未知的现实环境中，RelPose-TTA始终优于先前的TTA方法，同时大大减少了长期漂移并保持稳定的姿态预测。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Image and Vision Computing 工程技术-工程：电子与电气

CiteScore

8.50

自引率

8.50%

发文量

143

审稿时长

7.8 months

期刊介绍： Image and Vision Computing has as a primary aim the provision of an effective medium of interchange for the results of high quality theoretical and applied research fundamental to all aspects of image interpretation and computer vision. The journal publishes work that proposes new image interpretation and computer vision methodology or addresses the application of such methods to real world scenes. It seeks to strengthen a deeper understanding in the discipline by encouraging the quantitative comparison and performance evaluation of the proposed methodology. The coverage includes: image interpretation, scene modelling, object recognition and tracking, shape analysis, monitoring and surveillance, active vision and robotic systems, SLAM, biologically-inspired computer vision, motion analysis, stereo vision, document image understanding, character and handwritten text recognition, face and gesture recognition, biometrics, vision-based human-computer interaction, human activity and behavior understanding, data fusion from multiple sensor inputs, image databases.