IRPE: Instance-level reconstruction-based 6D pose estimator

IF 4.2 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Image and Vision Computing Pub Date : 2025-02-01 DOI:10.1016/j.imavis.2024.105340

Le Jin , Guoshun Zhou , Zherong Liu , Yuanchao Yu , Teng Zhang , Minghui Yang , Jun Zhou

{"title":"IRPE: Instance-level reconstruction-based 6D pose estimator","authors":"Le Jin , Guoshun Zhou , Zherong Liu , Yuanchao Yu , Teng Zhang , Minghui Yang , Jun Zhou","doi":"10.1016/j.imavis.2024.105340","DOIUrl":null,"url":null,"abstract":"<div><div>The estimation of an object’s 6D pose is a fundamental task in modern commercial and industrial applications. Vision-based pose estimation has gained popularity due to its cost-effectiveness and ease of setup in the field. However, this type of estimation tends to be less robust compared to other methods due to its sensitivity to the operating environment. For instance, in robot manipulation applications, heavy occlusion and clutter are common, posing significant challenges. For safety and robustness in industrial environments, depth information is often leveraged instead of relying solely on RGB images. Nevertheless, even with depth information, 6D pose estimation in such scenarios still remains challenging. In this paper, we introduce a novel 6D pose estimation method that promotes the network’s learning of high-level object features through self-supervised learning and instance reconstruction. The feature representation of the reconstructed instance is subsequently utilized in direct 6D pose regression via a multi-task learning scheme. As a result, the proposed method can differentiate and retrieve each object instance from a scene that is heavily occluded and cluttered, thereby surpassing conventional pose estimators in such scenarios. Additionally, due to the standardized prediction of reconstructed image, our estimator exhibits robustness performance against variations in lighting conditions and color drift. This is a significant improvement over traditional methods that depend on pixel-level sparse or dense features. We demonstrate that our method achieves state-of-the-art performance (e.g., 85.4% on LM-O) on the most commonly used benchmarks with respect to the ADD(-S) metric. Lastly, we present a CLIP dataset that emulates intense occlusion scenarios of industrial environment and conduct a real-world experiment for manipulation applications to verify the effectiveness and robustness of our proposed method.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"154 ","pages":"Article 105340"},"PeriodicalIF":4.2000,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Image and Vision Computing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0262885624004451","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

The estimation of an object’s 6D pose is a fundamental task in modern commercial and industrial applications. Vision-based pose estimation has gained popularity due to its cost-effectiveness and ease of setup in the field. However, this type of estimation tends to be less robust compared to other methods due to its sensitivity to the operating environment. For instance, in robot manipulation applications, heavy occlusion and clutter are common, posing significant challenges. For safety and robustness in industrial environments, depth information is often leveraged instead of relying solely on RGB images. Nevertheless, even with depth information, 6D pose estimation in such scenarios still remains challenging. In this paper, we introduce a novel 6D pose estimation method that promotes the network’s learning of high-level object features through self-supervised learning and instance reconstruction. The feature representation of the reconstructed instance is subsequently utilized in direct 6D pose regression via a multi-task learning scheme. As a result, the proposed method can differentiate and retrieve each object instance from a scene that is heavily occluded and cluttered, thereby surpassing conventional pose estimators in such scenarios. Additionally, due to the standardized prediction of reconstructed image, our estimator exhibits robustness performance against variations in lighting conditions and color drift. This is a significant improvement over traditional methods that depend on pixel-level sparse or dense features. We demonstrate that our method achieves state-of-the-art performance (e.g., 85.4% on LM-O) on the most commonly used benchmarks with respect to the ADD(-S) metric. Lastly, we present a CLIP dataset that emulates intense occlusion scenarios of industrial environment and conduct a real-world experiment for manipulation applications to verify the effectiveness and robustness of our proposed method.

Abstract Image

查看原文本刊更多论文

求助全文

约1分钟内获得全文求助全文

来源期刊

Image and Vision Computing 工程技术-工程：电子与电气

CiteScore

8.50

自引率

8.50%

发文量

143

审稿时长

7.8 months

期刊介绍： Image and Vision Computing has as a primary aim the provision of an effective medium of interchange for the results of high quality theoretical and applied research fundamental to all aspects of image interpretation and computer vision. The journal publishes work that proposes new image interpretation and computer vision methodology or addresses the application of such methods to real world scenes. It seeks to strengthen a deeper understanding in the discipline by encouraging the quantitative comparison and performance evaluation of the proposed methodology. The coverage includes: image interpretation, scene modelling, object recognition and tracking, shape analysis, monitoring and surveillance, active vision and robotic systems, SLAM, biologically-inspired computer vision, motion analysis, stereo vision, document image understanding, character and handwritten text recognition, face and gesture recognition, biometrics, vision-based human-computer interaction, human activity and behavior understanding, data fusion from multiple sensor inputs, image databases.