Large vision-language models enabled novel objects 6D pose estimation for human-robot collaboration

IF 11.4 1区计算机科学 Q1 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS

Robotics and Computer-integrated Manufacturing Pub Date : 2025-04-26 DOI:10.1016/j.rcim.2025.103030

Wanqing Xia, Hao Zheng, Weiliang Xu, Xun Xu

{"title":"Large vision-language models enabled novel objects 6D pose estimation for human-robot collaboration","authors":"Wanqing Xia, Hao Zheng, Weiliang Xu, Xun Xu","doi":"10.1016/j.rcim.2025.103030","DOIUrl":null,"url":null,"abstract":"<div><div>Six-Degree-of-Freedom (6D) pose estimation is essential for robotic manipulation tasks, especially in human-robot collaboration environments. Recently, 6D pose estimation has been extended from seen objects to novel objects due to the frequent encounters with unfamiliar items in real-life scenarios. This paper presents a three-stage pipeline for 6D pose estimation of previously unseen objects, leveraging the capabilities of large vision-language models. Our approach consists of vision-language model-based object detection and segmentation, mask selection with pose hypothesis generated from CAD models, and refinement and scoring of pose candidates. We evaluate our method on the YCB-Video dataset, achieving a state-of-the-art Average Recall (AR) score of 75.8 with RGB-D images, demonstrating its effectiveness in accurately estimating 6D poses for a diverse range of objects. The effectiveness of each operation stage is investigated in the ablation study. To validate the practical applicability of our approach, we conduct case studies on a real-world robotic platform, focusing on object pick-up tasks by integrating our 6D pose estimation pipeline with human intention prediction and task analysis algorithms. Results show that the proposed method can effectively handle novel objects in our test environments, as demonstrated through the YCB dataset evaluation and case studies. Our work contributes to the field of human-robot collaboration by introducing a flexible, generalizable approach to 6D pose estimation, enabling robots to adapt to new objects without requiring extensive retraining—a vital capability for advancing human-robot collaboration in dynamic environments. More information can be found in the project GitHub page: <span><span>https://github.com/WanqingXia/HRC_DetAnyPose</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":21452,"journal":{"name":"Robotics and Computer-integrated Manufacturing","volume":"95 ","pages":"Article 103030"},"PeriodicalIF":11.4000,"publicationDate":"2025-04-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Robotics and Computer-integrated Manufacturing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0736584525000845","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}

引用次数: 0

Abstract

Six-Degree-of-Freedom (6D) pose estimation is essential for robotic manipulation tasks, especially in human-robot collaboration environments. Recently, 6D pose estimation has been extended from seen objects to novel objects due to the frequent encounters with unfamiliar items in real-life scenarios. This paper presents a three-stage pipeline for 6D pose estimation of previously unseen objects, leveraging the capabilities of large vision-language models. Our approach consists of vision-language model-based object detection and segmentation, mask selection with pose hypothesis generated from CAD models, and refinement and scoring of pose candidates. We evaluate our method on the YCB-Video dataset, achieving a state-of-the-art Average Recall (AR) score of 75.8 with RGB-D images, demonstrating its effectiveness in accurately estimating 6D poses for a diverse range of objects. The effectiveness of each operation stage is investigated in the ablation study. To validate the practical applicability of our approach, we conduct case studies on a real-world robotic platform, focusing on object pick-up tasks by integrating our 6D pose estimation pipeline with human intention prediction and task analysis algorithms. Results show that the proposed method can effectively handle novel objects in our test environments, as demonstrated through the YCB dataset evaluation and case studies. Our work contributes to the field of human-robot collaboration by introducing a flexible, generalizable approach to 6D pose estimation, enabling robots to adapt to new objects without requiring extensive retraining—a vital capability for advancing human-robot collaboration in dynamic environments. More information can be found in the project GitHub page: https://github.com/WanqingXia/HRC_DetAnyPose.

查看原文本刊更多论文

大型视觉语言模型为人机协作提供了新的物体6D姿态估计

六自由度（6D）姿态估计对于机器人操作任务至关重要，特别是在人机协作环境中。最近，由于在现实生活中经常遇到不熟悉的物体，6D姿态估计已经从看到的物体扩展到新的物体。本文提出了一种利用大型视觉语言模型的能力，对以前未见过的物体进行6D姿态估计的三阶段管道。我们的方法包括基于视觉语言模型的目标检测和分割，基于CAD模型生成的姿态假设的掩码选择，以及姿态候选的细化和评分。我们在YCB-Video数据集上评估了我们的方法，在RGB-D图像上实现了最先进的平均召回（AR）分数75.8，证明了它在准确估计各种物体的6D姿势方面的有效性。在消融研究中考察了各操作阶段的有效性。为了验证我们方法的实际适用性，我们在现实世界的机器人平台上进行了案例研究，通过将我们的6D姿态估计管道与人类意图预测和任务分析算法集成在一起，专注于物体拾取任务。结果表明，该方法可以有效地处理我们测试环境中的新对象，并通过YCB数据集评估和案例研究进行了验证。我们的工作通过引入一种灵活的、可推广的6D姿态估计方法，为人机协作领域做出了贡献，使机器人能够适应新的物体而不需要大量的再训练——这是在动态环境中推进人机协作的重要能力。更多信息可以在项目GitHub页面中找到：https://github.com/WanqingXia/HRC_DetAnyPose。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Robotics and Computer-integrated Manufacturing 工程技术-工程：制造

CiteScore

24.10

自引率

13.50%

发文量

160

审稿时长

50 days

期刊介绍： The journal, Robotics and Computer-Integrated Manufacturing, focuses on sharing research applications that contribute to the development of new or enhanced robotics, manufacturing technologies, and innovative manufacturing strategies that are relevant to industry. Papers that combine theory and experimental validation are preferred, while review papers on current robotics and manufacturing issues are also considered. However, papers on traditional machining processes, modeling and simulation, supply chain management, and resource optimization are generally not within the scope of the journal, as there are more appropriate journals for these topics. Similarly, papers that are overly theoretical or mathematical will be directed to other suitable journals. The journal welcomes original papers in areas such as industrial robotics, human-robot collaboration in manufacturing, cloud-based manufacturing, cyber-physical production systems, big data analytics in manufacturing, smart mechatronics, machine learning, adaptive and sustainable manufacturing, and other fields involving unique manufacturing technologies.