Yifan Xu,Chao Zhang,Hanqi Jiang,Xiaoyan Wang,Ruifei Ma,Yiwei Li,Zihao Wu,Zeju Li,Xiangde Liu
{"title":"Argus:利用多视图图像来提高对大型语言模型的3d场景理解。","authors":"Yifan Xu,Chao Zhang,Hanqi Jiang,Xiaoyan Wang,Ruifei Ma,Yiwei Li,Zihao Wu,Zeju Li,Xiangde Liu","doi":"10.1109/tnnls.2025.3581411","DOIUrl":null,"url":null,"abstract":"Advancements in foundation models have made it possible to conduct applications in various downstream tasks. Especially, the new era has witnessed a remarkable capability to extend large language models (LLMs) for tackling tasks of 3-D scene understanding. Current methods rely heavily on 3-D point clouds, but the 3-D point cloud reconstruction of an indoor scene often results in information loss. Some textureless planes or repetitive patterns are prone to omission and manifest as voids within the reconstructed 3-D point clouds. Besides, objects with complex structures tend to introduce distortion of details caused by misalignments between the captured images and the dense reconstructed point clouds. The 2-D multiview images present visual consistency with 3-D point clouds and provide more detailed representations of scene components, which can naturally compensate for these deficiencies. Based on these insights, we propose Argus, a novel 3-D multimodal framework that leverages multiview images for enhanced 3-D scene understanding with LLMs. In general, Argus can be treated as a 3-D large multimodal foundation model (3D-LMM) since it takes various modalities as input (text instructions, 2-D multiview images, and 3-D point clouds) and expands the capability of LLMs to tackle 3-D tasks. Argus involves fusing and integrating multiview images and camera poses into view-as-scene features, which interact with the 3-D features to create comprehensive and detailed 3-D-aware scene embeddings. Our approach compensates for the information loss while reconstructing 3-D point clouds and helps LLMs better understand the 3-D world. Extensive experiments demonstrate that our method outperforms existing 3D-LMMs in various downstream tasks.","PeriodicalId":13303,"journal":{"name":"IEEE transactions on neural networks and learning systems","volume":"21 1","pages":""},"PeriodicalIF":8.9000,"publicationDate":"2025-06-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Argus: Leveraging Multiview Images for Improved 3-D Scene Understanding With Large Language Models.\",\"authors\":\"Yifan Xu,Chao Zhang,Hanqi Jiang,Xiaoyan Wang,Ruifei Ma,Yiwei Li,Zihao Wu,Zeju Li,Xiangde Liu\",\"doi\":\"10.1109/tnnls.2025.3581411\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Advancements in foundation models have made it possible to conduct applications in various downstream tasks. Especially, the new era has witnessed a remarkable capability to extend large language models (LLMs) for tackling tasks of 3-D scene understanding. Current methods rely heavily on 3-D point clouds, but the 3-D point cloud reconstruction of an indoor scene often results in information loss. Some textureless planes or repetitive patterns are prone to omission and manifest as voids within the reconstructed 3-D point clouds. Besides, objects with complex structures tend to introduce distortion of details caused by misalignments between the captured images and the dense reconstructed point clouds. The 2-D multiview images present visual consistency with 3-D point clouds and provide more detailed representations of scene components, which can naturally compensate for these deficiencies. Based on these insights, we propose Argus, a novel 3-D multimodal framework that leverages multiview images for enhanced 3-D scene understanding with LLMs. In general, Argus can be treated as a 3-D large multimodal foundation model (3D-LMM) since it takes various modalities as input (text instructions, 2-D multiview images, and 3-D point clouds) and expands the capability of LLMs to tackle 3-D tasks. Argus involves fusing and integrating multiview images and camera poses into view-as-scene features, which interact with the 3-D features to create comprehensive and detailed 3-D-aware scene embeddings. Our approach compensates for the information loss while reconstructing 3-D point clouds and helps LLMs better understand the 3-D world. Extensive experiments demonstrate that our method outperforms existing 3D-LMMs in various downstream tasks.\",\"PeriodicalId\":13303,\"journal\":{\"name\":\"IEEE transactions on neural networks and learning systems\",\"volume\":\"21 1\",\"pages\":\"\"},\"PeriodicalIF\":8.9000,\"publicationDate\":\"2025-06-25\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE transactions on neural networks and learning systems\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/10.1109/tnnls.2025.3581411\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on neural networks and learning systems","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1109/tnnls.2025.3581411","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
Argus: Leveraging Multiview Images for Improved 3-D Scene Understanding With Large Language Models.
Advancements in foundation models have made it possible to conduct applications in various downstream tasks. Especially, the new era has witnessed a remarkable capability to extend large language models (LLMs) for tackling tasks of 3-D scene understanding. Current methods rely heavily on 3-D point clouds, but the 3-D point cloud reconstruction of an indoor scene often results in information loss. Some textureless planes or repetitive patterns are prone to omission and manifest as voids within the reconstructed 3-D point clouds. Besides, objects with complex structures tend to introduce distortion of details caused by misalignments between the captured images and the dense reconstructed point clouds. The 2-D multiview images present visual consistency with 3-D point clouds and provide more detailed representations of scene components, which can naturally compensate for these deficiencies. Based on these insights, we propose Argus, a novel 3-D multimodal framework that leverages multiview images for enhanced 3-D scene understanding with LLMs. In general, Argus can be treated as a 3-D large multimodal foundation model (3D-LMM) since it takes various modalities as input (text instructions, 2-D multiview images, and 3-D point clouds) and expands the capability of LLMs to tackle 3-D tasks. Argus involves fusing and integrating multiview images and camera poses into view-as-scene features, which interact with the 3-D features to create comprehensive and detailed 3-D-aware scene embeddings. Our approach compensates for the information loss while reconstructing 3-D point clouds and helps LLMs better understand the 3-D world. Extensive experiments demonstrate that our method outperforms existing 3D-LMMs in various downstream tasks.
期刊介绍:
The focus of IEEE Transactions on Neural Networks and Learning Systems is to present scholarly articles discussing the theory, design, and applications of neural networks as well as other learning systems. The journal primarily highlights technical and scientific research in this domain.