Yue Zhan , Xin Wang , Zhaoxiang Liu , Shiguo Lian , Tangwen Yang
{"title":"RelPose-TTA: Energy-based relative pose correction for test-time adaptation of category-level object pose estimation","authors":"Yue Zhan , Xin Wang , Zhaoxiang Liu , Shiguo Lian , Tangwen Yang","doi":"10.1016/j.imavis.2026.105928","DOIUrl":null,"url":null,"abstract":"<div><div>Category-level object pose estimation is fundamental for robotic grasping and manipulation, yet models trained on synthetic data often generalize poorly to real-world environments due to substantial domain gaps. Test-time adaptation (TTA) offers a promising solution to address this challenge, but existing methods frequently depend on noisy pseudo-labels or complex optimization, which can lead to performance degradation and error accumulation over time. In this paper, we propose RelPose-TTA, a test-time adaptation framework that improves the generalization and long-term stability for category-level object pose estimation in previously unseen real-world environments. The core idea is to exploit the relative motion between consecutive frames, which is typically more stable and reliable than single-frame absolute pose estimation, and to use it as a self-supervisory signal during inference. Concretely, RelPose-TTA introduces an energy-based relative pose corrector to model inter-frame motion and mitigate ambiguities induced by occlusions, object symmetries, and large viewpoint changes. During test-time adaptation, the corrector is updated online via contrastive learning and is tightly coupled with point cloud registration, so that refined relative pose estimates can effectively guide absolute pose refinement. Extensive experiments demonstrate that RelPose-TTA consistently outperforms prior TTA methods in unseen real-world settings, while substantially reducing long-term drift and maintaining stable pose predictions.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"168 ","pages":"Article 105928"},"PeriodicalIF":4.2000,"publicationDate":"2026-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Image and Vision Computing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S026288562600034X","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2026/2/7 0:00:00","PubModel":"Epub","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Category-level object pose estimation is fundamental for robotic grasping and manipulation, yet models trained on synthetic data often generalize poorly to real-world environments due to substantial domain gaps. Test-time adaptation (TTA) offers a promising solution to address this challenge, but existing methods frequently depend on noisy pseudo-labels or complex optimization, which can lead to performance degradation and error accumulation over time. In this paper, we propose RelPose-TTA, a test-time adaptation framework that improves the generalization and long-term stability for category-level object pose estimation in previously unseen real-world environments. The core idea is to exploit the relative motion between consecutive frames, which is typically more stable and reliable than single-frame absolute pose estimation, and to use it as a self-supervisory signal during inference. Concretely, RelPose-TTA introduces an energy-based relative pose corrector to model inter-frame motion and mitigate ambiguities induced by occlusions, object symmetries, and large viewpoint changes. During test-time adaptation, the corrector is updated online via contrastive learning and is tightly coupled with point cloud registration, so that refined relative pose estimates can effectively guide absolute pose refinement. Extensive experiments demonstrate that RelPose-TTA consistently outperforms prior TTA methods in unseen real-world settings, while substantially reducing long-term drift and maintaining stable pose predictions.
期刊介绍:
Image and Vision Computing has as a primary aim the provision of an effective medium of interchange for the results of high quality theoretical and applied research fundamental to all aspects of image interpretation and computer vision. The journal publishes work that proposes new image interpretation and computer vision methodology or addresses the application of such methods to real world scenes. It seeks to strengthen a deeper understanding in the discipline by encouraging the quantitative comparison and performance evaluation of the proposed methodology. The coverage includes: image interpretation, scene modelling, object recognition and tracking, shape analysis, monitoring and surveillance, active vision and robotic systems, SLAM, biologically-inspired computer vision, motion analysis, stereo vision, document image understanding, character and handwritten text recognition, face and gesture recognition, biometrics, vision-based human-computer interaction, human activity and behavior understanding, data fusion from multiple sensor inputs, image databases.