RGB-D图像多目标全三维信息估计的状态空间模型

IF 9.4 1区计算机科学 Q1 AUTOMATION & CONTROL SYSTEMS

IEEE Transactions on Cybernetics Pub Date : 2025-03-19 DOI:10.1109/TCYB.2025.3548788

Jiaming Zhou;Qing Zhu;Yaonan Wang;Mingtao Feng;Jian Liu;Jianan Huang;Ajmal Mian

{"title":"RGB-D图像多目标全三维信息估计的状态空间模型","authors":"Jiaming Zhou;Qing Zhu;Yaonan Wang;Mingtao Feng;Jian Liu;Jianan Huang;Ajmal Mian","doi":"10.1109/TCYB.2025.3548788","DOIUrl":null,"url":null,"abstract":"Visual understanding of 3-D objects is essential for robotic manipulation, autonomous navigation, and augmented reality. However, existing methods struggle to perform this task efficiently and accurately in an end-to-end manner. We propose a single-shot method based on the state space model (SSM) to predict the full 3-D information (pose, size, shape) of multiple 3-D objects from a single RGB-D image in an end-to-end manner. Our method first encodes long-range semantic information from RGB and depth images separately and then combines them into an integrated latent representation that is processed by a modified SSM to infer the full 3-D information in two separate task heads within a unified model. A heatmap/detection head predicts object centers, and a 3-D information head predicts a matrix detailing the pose, size and latent code of shape for each detected object. We also propose a shape autoencoder based on the SSM, which learns canonical shape codes derived from a large database of 3-D point cloud shapes. The end-to-end framework, modified SSM block and SSM-based shape autoencoder form major contributions of this work. Our design includes different scan strategies tailored to different input data representations, such as RGB-D images and point clouds. Extensive evaluations on the REAL275, CAMERA25, and Wild6D datasets show that our method achieves state-of-the-art performance. On the large-scale Wild6D dataset, our model significantly outperforms the nearest competitor, achieving 2.6% and 5.1% improvements on the IOU-50 and 5°10 cm metrics, respectively.","PeriodicalId":13112,"journal":{"name":"IEEE Transactions on Cybernetics","volume":"55 5","pages":"2248-2260"},"PeriodicalIF":9.4000,"publicationDate":"2025-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A State Space Model for Multiobject Full 3-D Information Estimation From RGB-D Images\",\"authors\":\"Jiaming Zhou;Qing Zhu;Yaonan Wang;Mingtao Feng;Jian Liu;Jianan Huang;Ajmal Mian\",\"doi\":\"10.1109/TCYB.2025.3548788\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Visual understanding of 3-D objects is essential for robotic manipulation, autonomous navigation, and augmented reality. However, existing methods struggle to perform this task efficiently and accurately in an end-to-end manner. We propose a single-shot method based on the state space model (SSM) to predict the full 3-D information (pose, size, shape) of multiple 3-D objects from a single RGB-D image in an end-to-end manner. Our method first encodes long-range semantic information from RGB and depth images separately and then combines them into an integrated latent representation that is processed by a modified SSM to infer the full 3-D information in two separate task heads within a unified model. A heatmap/detection head predicts object centers, and a 3-D information head predicts a matrix detailing the pose, size and latent code of shape for each detected object. We also propose a shape autoencoder based on the SSM, which learns canonical shape codes derived from a large database of 3-D point cloud shapes. The end-to-end framework, modified SSM block and SSM-based shape autoencoder form major contributions of this work. Our design includes different scan strategies tailored to different input data representations, such as RGB-D images and point clouds. Extensive evaluations on the REAL275, CAMERA25, and Wild6D datasets show that our method achieves state-of-the-art performance. On the large-scale Wild6D dataset, our model significantly outperforms the nearest competitor, achieving 2.6% and 5.1% improvements on the IOU-50 and 5°10 cm metrics, respectively.\",\"PeriodicalId\":13112,\"journal\":{\"name\":\"IEEE Transactions on Cybernetics\",\"volume\":\"55 5\",\"pages\":\"2248-2260\"},\"PeriodicalIF\":9.4000,\"publicationDate\":\"2025-03-19\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Cybernetics\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10934139/\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"AUTOMATION & CONTROL SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Cybernetics","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10934139/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"AUTOMATION & CONTROL SYSTEMS","Score":null,"Total":0}

引用次数: 0

摘要

对三维物体的视觉理解对于机器人操作、自主导航和增强现实至关重要。然而，现有的方法很难以端到端方式高效、准确地执行此任务。我们提出了一种基于状态空间模型（SSM）的单镜头方法，以端到端方式从单张RGB-D图像中预测多个三维物体的完整三维信息（姿态、大小、形状）。我们的方法首先分别编码来自RGB和深度图像的远程语义信息，然后将它们组合成一个集成的潜在表示，通过改进的SSM处理，在统一模型中推断两个独立任务头的完整3d信息。热图/检测头预测物体中心，3d信息头预测矩阵，详细描述每个被检测物体的姿势、大小和潜在形状代码。我们还提出了一种基于SSM的形状自编码器，该编码器学习来自大型三维点云形状数据库的规范形状编码。端到端框架、改进的SSM块和基于SSM的形状自动编码器是本工作的主要贡献。我们的设计包括针对不同输入数据表示定制的不同扫描策略，例如RGB-D图像和点云。对REAL275、CAMERA25和Wild6D数据集的广泛评估表明，我们的方法达到了最先进的性能。在大规模Wild6D数据集上，我们的模型明显优于最接近的竞争对手，在iu -50和5°10 cm指标上分别实现了2.6%和5.1%的改进。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

A State Space Model for Multiobject Full 3-D Information Estimation From RGB-D Images

Visual understanding of 3-D objects is essential for robotic manipulation, autonomous navigation, and augmented reality. However, existing methods struggle to perform this task efficiently and accurately in an end-to-end manner. We propose a single-shot method based on the state space model (SSM) to predict the full 3-D information (pose, size, shape) of multiple 3-D objects from a single RGB-D image in an end-to-end manner. Our method first encodes long-range semantic information from RGB and depth images separately and then combines them into an integrated latent representation that is processed by a modified SSM to infer the full 3-D information in two separate task heads within a unified model. A heatmap/detection head predicts object centers, and a 3-D information head predicts a matrix detailing the pose, size and latent code of shape for each detected object. We also propose a shape autoencoder based on the SSM, which learns canonical shape codes derived from a large database of 3-D point cloud shapes. The end-to-end framework, modified SSM block and SSM-based shape autoencoder form major contributions of this work. Our design includes different scan strategies tailored to different input data representations, such as RGB-D images and point clouds. Extensive evaluations on the REAL275, CAMERA25, and Wild6D datasets show that our method achieves state-of-the-art performance. On the large-scale Wild6D dataset, our model significantly outperforms the nearest competitor, achieving 2.6% and 5.1% improvements on the IOU-50 and 5°10 cm metrics, respectively.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE Transactions on Cybernetics COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE-COMPUTER SCIENCE, CYBERNETICS

CiteScore

25.40

自引率

11.00%

发文量

1869

期刊介绍： The scope of the IEEE Transactions on Cybernetics includes computational approaches to the field of cybernetics. Specifically, the transactions welcomes papers on communication and control across machines or machine, human, and organizations. The scope includes such areas as computational intelligence, computer vision, neural networks, genetic algorithms, machine learning, fuzzy systems, cognitive systems, decision making, and robotics, to the extent that they contribute to the theme of cybernetics or demonstrate an application of cybernetics principles.