MMFEIR：多关注互特征增强与实例重构的类别级6D目标姿态估计

IF 4.2 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Image and Vision Computing Pub Date : 2025-07-16 DOI:10.1016/j.imavis.2025.105657

Haotian Lei, Xiangyu Liu, Yan Zhou, Guo Niu, Changan Yi, Yuexia Zhou, Xiaofeng Liang, Fuhe Liu

{"title":"MMFEIR：多关注互特征增强与实例重构的类别级6D目标姿态估计","authors":"Haotian Lei, Xiangyu Liu, Yan Zhou, Guo Niu, Changan Yi, Yuexia Zhou, Xiaofeng Liang, Fuhe Liu","doi":"10.1016/j.imavis.2025.105657","DOIUrl":null,"url":null,"abstract":"<div><div>Category-level 6D object pose estimation is a fundamental problem in fields such as robotic manipulation and augmented reality. The goal of this task is to predict the rotation, translation, and size of the object. Current research typically extracts the deformation field from observed point cloud of the object for estimating 6D pose. However, they did not fully consider the interaction between the observed point cloud, prior shape, and image of the object, resulting in the loss of geometric and texture features of the object, thereby affecting the accuracy of pose estimation for objects with large intra class configuration differences. In this paper, we propose a Multi-attention Mutual Feature Enhance Module (MMFEM) to enhance the inherent linkages among different perception data of objects. MMFEM enhances the interaction between images, observed point cloud, and prior shape through multiple attention modules. This enables the network to gain a deeper understanding of the differences between distinct instances. In addition, to improve the feature expression of geometric details for objects, we propose the Instance Reconstruction Deformation Module (IRDM). IRDM reconstructed the three-dimensional instance point cloud for each object, enhancing the model’s ability to identify differences in geometric configurations of objects. Extensive experiments on the CAMERA25 and REAL275 datasets show that the proposed methods have achieved 79.0% and 91.2% on the 3D75 metric, 52.6% and 75.9% on the 5°2 cm metric, respectively, outperforming current mainstream methods.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"162 ","pages":"Article 105657"},"PeriodicalIF":4.2000,"publicationDate":"2025-07-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"MMFEIR: Multi-attention Mutual Feature Enhance and Instance Reconstruction for category-level 6D object pose estimation\",\"authors\":\"Haotian Lei, Xiangyu Liu, Yan Zhou, Guo Niu, Changan Yi, Yuexia Zhou, Xiaofeng Liang, Fuhe Liu\",\"doi\":\"10.1016/j.imavis.2025.105657\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Category-level 6D object pose estimation is a fundamental problem in fields such as robotic manipulation and augmented reality. The goal of this task is to predict the rotation, translation, and size of the object. Current research typically extracts the deformation field from observed point cloud of the object for estimating 6D pose. However, they did not fully consider the interaction between the observed point cloud, prior shape, and image of the object, resulting in the loss of geometric and texture features of the object, thereby affecting the accuracy of pose estimation for objects with large intra class configuration differences. In this paper, we propose a Multi-attention Mutual Feature Enhance Module (MMFEM) to enhance the inherent linkages among different perception data of objects. MMFEM enhances the interaction between images, observed point cloud, and prior shape through multiple attention modules. This enables the network to gain a deeper understanding of the differences between distinct instances. In addition, to improve the feature expression of geometric details for objects, we propose the Instance Reconstruction Deformation Module (IRDM). IRDM reconstructed the three-dimensional instance point cloud for each object, enhancing the model’s ability to identify differences in geometric configurations of objects. Extensive experiments on the CAMERA25 and REAL275 datasets show that the proposed methods have achieved 79.0% and 91.2% on the 3D75 metric, 52.6% and 75.9% on the 5°2 cm metric, respectively, outperforming current mainstream methods.</div></div>\",\"PeriodicalId\":50374,\"journal\":{\"name\":\"Image and Vision Computing\",\"volume\":\"162 \",\"pages\":\"Article 105657\"},\"PeriodicalIF\":4.2000,\"publicationDate\":\"2025-07-16\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Image and Vision Computing\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0262885625002458\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Image and Vision Computing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0262885625002458","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

类别级6D物体姿态估计是机器人操作和增强现实等领域的一个基本问题。该任务的目标是预测对象的旋转、平移和大小。目前的研究主要是从物体的观测点云中提取变形场来估计物体的6D姿态。然而，他们没有充分考虑到观测点云、先验形状和物体图像之间的相互作用，导致物体的几何和纹理特征丢失，从而影响了类内配置差异较大的物体的姿态估计精度。本文提出了一种多注意力互特征增强模块（MMFEM）来增强不同对象感知数据之间的内在联系。MMFEM通过多个关注模块增强图像、观测点云和先验形状之间的交互作用。这使网络能够更深入地了解不同实例之间的差异。此外，为了改进物体几何细节的特征表达，我们提出了实例重构变形模块（IRDM）。IRDM对每个物体重建三维实例点云，增强了模型识别物体几何构型差异的能力。在CAMERA25和REAL275数据集上的大量实验表明，该方法在3D75度量上的准确率分别为79.0%和91.2%，在5°2 cm度量上的准确率分别为52.6%和75.9%，优于当前主流方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

MMFEIR: Multi-attention Mutual Feature Enhance and Instance Reconstruction for category-level 6D object pose estimation

Category-level 6D object pose estimation is a fundamental problem in fields such as robotic manipulation and augmented reality. The goal of this task is to predict the rotation, translation, and size of the object. Current research typically extracts the deformation field from observed point cloud of the object for estimating 6D pose. However, they did not fully consider the interaction between the observed point cloud, prior shape, and image of the object, resulting in the loss of geometric and texture features of the object, thereby affecting the accuracy of pose estimation for objects with large intra class configuration differences. In this paper, we propose a Multi-attention Mutual Feature Enhance Module (MMFEM) to enhance the inherent linkages among different perception data of objects. MMFEM enhances the interaction between images, observed point cloud, and prior shape through multiple attention modules. This enables the network to gain a deeper understanding of the differences between distinct instances. In addition, to improve the feature expression of geometric details for objects, we propose the Instance Reconstruction Deformation Module (IRDM). IRDM reconstructed the three-dimensional instance point cloud for each object, enhancing the model’s ability to identify differences in geometric configurations of objects. Extensive experiments on the CAMERA25 and REAL275 datasets show that the proposed methods have achieved 79.0% and 91.2% on the 3D75 metric, 52.6% and 75.9% on the 5°2 cm metric, respectively, outperforming current mainstream methods.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Image and Vision Computing 工程技术-工程：电子与电气

CiteScore

8.50

自引率

8.50%

发文量

143

审稿时长

7.8 months

期刊介绍： Image and Vision Computing has as a primary aim the provision of an effective medium of interchange for the results of high quality theoretical and applied research fundamental to all aspects of image interpretation and computer vision. The journal publishes work that proposes new image interpretation and computer vision methodology or addresses the application of such methods to real world scenes. It seeks to strengthen a deeper understanding in the discipline by encouraging the quantitative comparison and performance evaluation of the proposed methodology. The coverage includes: image interpretation, scene modelling, object recognition and tracking, shape analysis, monitoring and surveillance, active vision and robotic systems, SLAM, biologically-inspired computer vision, motion analysis, stereo vision, document image understanding, character and handwritten text recognition, face and gesture recognition, biometrics, vision-based human-computer interaction, human activity and behavior understanding, data fusion from multiple sensor inputs, image databases.