SF-Pose：基于工业场景金字塔变压器的语义融合六自由度物体姿态估计

IF 6.4 2区计算机科学 Q1 AUTOMATION & CONTROL SYSTEMS

IEEE Transactions on Automation Science and Engineering Pub Date : 2025-01-14 DOI:10.1109/TASE.2025.3529511

Jikun Wang;Yinlong Liu;Zhi-Xin Yang

{"title":"SF-Pose：基于工业场景金字塔变压器的语义融合六自由度物体姿态估计","authors":"Jikun Wang;Yinlong Liu;Zhi-Xin Yang","doi":"10.1109/TASE.2025.3529511","DOIUrl":null,"url":null,"abstract":"Object six degrees of freedom (6-DoF) pose estimation is the powerful vision algorithm for the robot-environment interaction. However, current robust pose estimation algorithms rely heavily on labeled real data with high-cost collection, making it difficult to apply the algorithm. Many studies discuss the use of synthetic data as a complement to real datasets. However, reducing the gap between synthetic and real data is still a challenging problem. Based on the consistency of object geometric characteristics between real data and synthetic data, we argue that multi-input, rather than image-only input, is more suitable for transfer from synthetic to real, because it strengthens the extraction of object geometric feature. Therefore, we propose a semantic-fusion 6-DoF object pose estimation method that effectively capture common features across various resolutions by employing the designed pyramid transformer feature-fusion module. Extensive experiments show that the proposed method performs better than the state-of-the-art (SOTA), indicating that the proposed method can effectively extract and fuse different representations. Furthermore, in response to the lack of industrial scene datasets, we also develop a synthetic pose dataset and conduct the human-robot collaboration experiment to verify the robustness of the proposed method. Note to Practitioners—The purpose of this paper is to bridge the gap between synthetic and real data for pose estimation of industrial tools. Our method can be trained only on synthetic data and accurately estimate pose parameters in real scenes. Combining physically-based renderer and industrial tools, such as hammers and screwdrivers, a synthetic dataset of industrial scenes can be produced using the data production pipeline proposed in this paper. In this case, the trained model can assist the robot vision system to understand object pose information in a real production workshop. Extensive dataset experiments and human-robot collaboration experiments demonstrate the effectiveness of the proposed method. In addition, based on the actual robot working environment, practitioners can produce industrial datasets from multiple angles, objects, and scenes. Sufficient datasets can enhance the model’s generalization and robustness.","PeriodicalId":51060,"journal":{"name":"IEEE Transactions on Automation Science and Engineering","volume":"22 ","pages":"11767-11779"},"PeriodicalIF":6.4000,"publicationDate":"2025-01-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"SF-Pose: Semantic-Fusion Six Degrees of Freedom Object Pose Estimation via Pyramid Transformer for Industrial Scenarios\",\"authors\":\"Jikun Wang;Yinlong Liu;Zhi-Xin Yang\",\"doi\":\"10.1109/TASE.2025.3529511\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Object six degrees of freedom (6-DoF) pose estimation is the powerful vision algorithm for the robot-environment interaction. However, current robust pose estimation algorithms rely heavily on labeled real data with high-cost collection, making it difficult to apply the algorithm. Many studies discuss the use of synthetic data as a complement to real datasets. However, reducing the gap between synthetic and real data is still a challenging problem. Based on the consistency of object geometric characteristics between real data and synthetic data, we argue that multi-input, rather than image-only input, is more suitable for transfer from synthetic to real, because it strengthens the extraction of object geometric feature. Therefore, we propose a semantic-fusion 6-DoF object pose estimation method that effectively capture common features across various resolutions by employing the designed pyramid transformer feature-fusion module. Extensive experiments show that the proposed method performs better than the state-of-the-art (SOTA), indicating that the proposed method can effectively extract and fuse different representations. Furthermore, in response to the lack of industrial scene datasets, we also develop a synthetic pose dataset and conduct the human-robot collaboration experiment to verify the robustness of the proposed method. Note to Practitioners—The purpose of this paper is to bridge the gap between synthetic and real data for pose estimation of industrial tools. Our method can be trained only on synthetic data and accurately estimate pose parameters in real scenes. Combining physically-based renderer and industrial tools, such as hammers and screwdrivers, a synthetic dataset of industrial scenes can be produced using the data production pipeline proposed in this paper. In this case, the trained model can assist the robot vision system to understand object pose information in a real production workshop. Extensive dataset experiments and human-robot collaboration experiments demonstrate the effectiveness of the proposed method. In addition, based on the actual robot working environment, practitioners can produce industrial datasets from multiple angles, objects, and scenes. Sufficient datasets can enhance the model’s generalization and robustness.\",\"PeriodicalId\":51060,\"journal\":{\"name\":\"IEEE Transactions on Automation Science and Engineering\",\"volume\":\"22 \",\"pages\":\"11767-11779\"},\"PeriodicalIF\":6.4000,\"publicationDate\":\"2025-01-14\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Automation Science and Engineering\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10841381/\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"AUTOMATION & CONTROL SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Automation Science and Engineering","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10841381/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"AUTOMATION & CONTROL SYSTEMS","Score":null,"Total":0}

引用次数: 0

摘要

物体六自由度姿态估计是机器人与环境交互的一种强大的视觉算法。然而，目前的鲁棒姿态估计算法严重依赖于高成本的标记真实数据，使得该算法难以应用。许多研究讨论使用合成数据作为真实数据集的补充。然而，缩小合成数据和真实数据之间的差距仍然是一个具有挑战性的问题。基于物体几何特征在真实数据和合成数据之间的一致性，我们认为多输入比单图像输入更适合从合成到真实的转换，因为它加强了物体几何特征的提取。因此，我们提出了一种语义融合的六自由度目标姿态估计方法，该方法利用设计的金字塔变压器特征融合模块，有效地捕获了不同分辨率下的共同特征。大量的实验表明，该方法的性能优于最先进的SOTA，表明该方法可以有效地提取和融合不同的表征。此外，针对工业场景数据集的不足，我们还开发了一个合成姿态数据集，并进行了人机协作实验来验证所提方法的鲁棒性。从业人员注意：本文的目的是弥合工业工具姿态估计的合成数据和实际数据之间的差距。我们的方法可以只在合成数据上进行训练，并且可以准确地估计真实场景中的姿态参数。结合基于物理的渲染器和工业工具（如锤子和螺丝刀），可以使用本文提出的数据生产管道生成工业场景的合成数据集。在这种情况下，训练后的模型可以帮助机器人视觉系统理解真实生产车间中的物体姿态信息。大量的数据集实验和人机协作实验证明了该方法的有效性。此外，基于机器人的实际工作环境，从业者可以从多个角度、对象和场景生成工业数据集。充足的数据集可以增强模型的泛化性和鲁棒性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

SF-Pose: Semantic-Fusion Six Degrees of Freedom Object Pose Estimation via Pyramid Transformer for Industrial Scenarios

Object six degrees of freedom (6-DoF) pose estimation is the powerful vision algorithm for the robot-environment interaction. However, current robust pose estimation algorithms rely heavily on labeled real data with high-cost collection, making it difficult to apply the algorithm. Many studies discuss the use of synthetic data as a complement to real datasets. However, reducing the gap between synthetic and real data is still a challenging problem. Based on the consistency of object geometric characteristics between real data and synthetic data, we argue that multi-input, rather than image-only input, is more suitable for transfer from synthetic to real, because it strengthens the extraction of object geometric feature. Therefore, we propose a semantic-fusion 6-DoF object pose estimation method that effectively capture common features across various resolutions by employing the designed pyramid transformer feature-fusion module. Extensive experiments show that the proposed method performs better than the state-of-the-art (SOTA), indicating that the proposed method can effectively extract and fuse different representations. Furthermore, in response to the lack of industrial scene datasets, we also develop a synthetic pose dataset and conduct the human-robot collaboration experiment to verify the robustness of the proposed method. Note to Practitioners—The purpose of this paper is to bridge the gap between synthetic and real data for pose estimation of industrial tools. Our method can be trained only on synthetic data and accurately estimate pose parameters in real scenes. Combining physically-based renderer and industrial tools, such as hammers and screwdrivers, a synthetic dataset of industrial scenes can be produced using the data production pipeline proposed in this paper. In this case, the trained model can assist the robot vision system to understand object pose information in a real production workshop. Extensive dataset experiments and human-robot collaboration experiments demonstrate the effectiveness of the proposed method. In addition, based on the actual robot working environment, practitioners can produce industrial datasets from multiple angles, objects, and scenes. Sufficient datasets can enhance the model’s generalization and robustness.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE Transactions on Automation Science and Engineering 工程技术-自动化与控制系统

CiteScore

12.50

自引率

14.30%

发文量

404

审稿时长

3.0 months

期刊介绍： The IEEE Transactions on Automation Science and Engineering (T-ASE) publishes fundamental papers on Automation, emphasizing scientific results that advance efficiency, quality, productivity, and reliability. T-ASE encourages interdisciplinary approaches from computer science, control systems, electrical engineering, mathematics, mechanical engineering, operations research, and other fields. T-ASE welcomes results relevant to industries such as agriculture, biotechnology, healthcare, home automation, maintenance, manufacturing, pharmaceuticals, retail, security, service, supply chains, and transportation. T-ASE addresses a research community willing to integrate knowledge across disciplines and industries. For this purpose, each paper includes a Note to Practitioners that summarizes how its results can be applied or how they might be extended to apply in practice.