{"title":"SF-Pose:基于工业场景金字塔变压器的语义融合六自由度物体姿态估计","authors":"Jikun Wang;Yinlong Liu;Zhi-Xin Yang","doi":"10.1109/TASE.2025.3529511","DOIUrl":null,"url":null,"abstract":"Object six degrees of freedom (6-DoF) pose estimation is the powerful vision algorithm for the robot-environment interaction. However, current robust pose estimation algorithms rely heavily on labeled real data with high-cost collection, making it difficult to apply the algorithm. Many studies discuss the use of synthetic data as a complement to real datasets. However, reducing the gap between synthetic and real data is still a challenging problem. Based on the consistency of object geometric characteristics between real data and synthetic data, we argue that multi-input, rather than image-only input, is more suitable for transfer from synthetic to real, because it strengthens the extraction of object geometric feature. Therefore, we propose a semantic-fusion 6-DoF object pose estimation method that effectively capture common features across various resolutions by employing the designed pyramid transformer feature-fusion module. Extensive experiments show that the proposed method performs better than the state-of-the-art (SOTA), indicating that the proposed method can effectively extract and fuse different representations. Furthermore, in response to the lack of industrial scene datasets, we also develop a synthetic pose dataset and conduct the human-robot collaboration experiment to verify the robustness of the proposed method. Note to Practitioners—The purpose of this paper is to bridge the gap between synthetic and real data for pose estimation of industrial tools. Our method can be trained only on synthetic data and accurately estimate pose parameters in real scenes. Combining physically-based renderer and industrial tools, such as hammers and screwdrivers, a synthetic dataset of industrial scenes can be produced using the data production pipeline proposed in this paper. In this case, the trained model can assist the robot vision system to understand object pose information in a real production workshop. Extensive dataset experiments and human-robot collaboration experiments demonstrate the effectiveness of the proposed method. In addition, based on the actual robot working environment, practitioners can produce industrial datasets from multiple angles, objects, and scenes. Sufficient datasets can enhance the model’s generalization and robustness.","PeriodicalId":51060,"journal":{"name":"IEEE Transactions on Automation Science and Engineering","volume":"22 ","pages":"11767-11779"},"PeriodicalIF":6.4000,"publicationDate":"2025-01-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"SF-Pose: Semantic-Fusion Six Degrees of Freedom Object Pose Estimation via Pyramid Transformer for Industrial Scenarios\",\"authors\":\"Jikun Wang;Yinlong Liu;Zhi-Xin Yang\",\"doi\":\"10.1109/TASE.2025.3529511\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Object six degrees of freedom (6-DoF) pose estimation is the powerful vision algorithm for the robot-environment interaction. However, current robust pose estimation algorithms rely heavily on labeled real data with high-cost collection, making it difficult to apply the algorithm. Many studies discuss the use of synthetic data as a complement to real datasets. However, reducing the gap between synthetic and real data is still a challenging problem. Based on the consistency of object geometric characteristics between real data and synthetic data, we argue that multi-input, rather than image-only input, is more suitable for transfer from synthetic to real, because it strengthens the extraction of object geometric feature. Therefore, we propose a semantic-fusion 6-DoF object pose estimation method that effectively capture common features across various resolutions by employing the designed pyramid transformer feature-fusion module. Extensive experiments show that the proposed method performs better than the state-of-the-art (SOTA), indicating that the proposed method can effectively extract and fuse different representations. Furthermore, in response to the lack of industrial scene datasets, we also develop a synthetic pose dataset and conduct the human-robot collaboration experiment to verify the robustness of the proposed method. Note to Practitioners—The purpose of this paper is to bridge the gap between synthetic and real data for pose estimation of industrial tools. Our method can be trained only on synthetic data and accurately estimate pose parameters in real scenes. Combining physically-based renderer and industrial tools, such as hammers and screwdrivers, a synthetic dataset of industrial scenes can be produced using the data production pipeline proposed in this paper. In this case, the trained model can assist the robot vision system to understand object pose information in a real production workshop. Extensive dataset experiments and human-robot collaboration experiments demonstrate the effectiveness of the proposed method. In addition, based on the actual robot working environment, practitioners can produce industrial datasets from multiple angles, objects, and scenes. Sufficient datasets can enhance the model’s generalization and robustness.\",\"PeriodicalId\":51060,\"journal\":{\"name\":\"IEEE Transactions on Automation Science and Engineering\",\"volume\":\"22 \",\"pages\":\"11767-11779\"},\"PeriodicalIF\":6.4000,\"publicationDate\":\"2025-01-14\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Automation Science and Engineering\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10841381/\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"AUTOMATION & CONTROL SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Automation Science and Engineering","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10841381/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"AUTOMATION & CONTROL SYSTEMS","Score":null,"Total":0}
SF-Pose: Semantic-Fusion Six Degrees of Freedom Object Pose Estimation via Pyramid Transformer for Industrial Scenarios
Object six degrees of freedom (6-DoF) pose estimation is the powerful vision algorithm for the robot-environment interaction. However, current robust pose estimation algorithms rely heavily on labeled real data with high-cost collection, making it difficult to apply the algorithm. Many studies discuss the use of synthetic data as a complement to real datasets. However, reducing the gap between synthetic and real data is still a challenging problem. Based on the consistency of object geometric characteristics between real data and synthetic data, we argue that multi-input, rather than image-only input, is more suitable for transfer from synthetic to real, because it strengthens the extraction of object geometric feature. Therefore, we propose a semantic-fusion 6-DoF object pose estimation method that effectively capture common features across various resolutions by employing the designed pyramid transformer feature-fusion module. Extensive experiments show that the proposed method performs better than the state-of-the-art (SOTA), indicating that the proposed method can effectively extract and fuse different representations. Furthermore, in response to the lack of industrial scene datasets, we also develop a synthetic pose dataset and conduct the human-robot collaboration experiment to verify the robustness of the proposed method. Note to Practitioners—The purpose of this paper is to bridge the gap between synthetic and real data for pose estimation of industrial tools. Our method can be trained only on synthetic data and accurately estimate pose parameters in real scenes. Combining physically-based renderer and industrial tools, such as hammers and screwdrivers, a synthetic dataset of industrial scenes can be produced using the data production pipeline proposed in this paper. In this case, the trained model can assist the robot vision system to understand object pose information in a real production workshop. Extensive dataset experiments and human-robot collaboration experiments demonstrate the effectiveness of the proposed method. In addition, based on the actual robot working environment, practitioners can produce industrial datasets from multiple angles, objects, and scenes. Sufficient datasets can enhance the model’s generalization and robustness.
期刊介绍:
The IEEE Transactions on Automation Science and Engineering (T-ASE) publishes fundamental papers on Automation, emphasizing scientific results that advance efficiency, quality, productivity, and reliability. T-ASE encourages interdisciplinary approaches from computer science, control systems, electrical engineering, mathematics, mechanical engineering, operations research, and other fields. T-ASE welcomes results relevant to industries such as agriculture, biotechnology, healthcare, home automation, maintenance, manufacturing, pharmaceuticals, retail, security, service, supply chains, and transportation. T-ASE addresses a research community willing to integrate knowledge across disciplines and industries. For this purpose, each paper includes a Note to Practitioners that summarizes how its results can be applied or how they might be extended to apply in practice.