Argus：利用多视图图像来提高对大型语言模型的3d场景理解。

IF 8.9 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

IEEE transactions on neural networks and learning systems Pub Date : 2025-06-25 DOI:10.1109/tnnls.2025.3581411

Yifan Xu,Chao Zhang,Hanqi Jiang,Xiaoyan Wang,Ruifei Ma,Yiwei Li,Zihao Wu,Zeju Li,Xiangde Liu

{"title":"Argus：利用多视图图像来提高对大型语言模型的3d场景理解。","authors":"Yifan Xu,Chao Zhang,Hanqi Jiang,Xiaoyan Wang,Ruifei Ma,Yiwei Li,Zihao Wu,Zeju Li,Xiangde Liu","doi":"10.1109/tnnls.2025.3581411","DOIUrl":null,"url":null,"abstract":"Advancements in foundation models have made it possible to conduct applications in various downstream tasks. Especially, the new era has witnessed a remarkable capability to extend large language models (LLMs) for tackling tasks of 3-D scene understanding. Current methods rely heavily on 3-D point clouds, but the 3-D point cloud reconstruction of an indoor scene often results in information loss. Some textureless planes or repetitive patterns are prone to omission and manifest as voids within the reconstructed 3-D point clouds. Besides, objects with complex structures tend to introduce distortion of details caused by misalignments between the captured images and the dense reconstructed point clouds. The 2-D multiview images present visual consistency with 3-D point clouds and provide more detailed representations of scene components, which can naturally compensate for these deficiencies. Based on these insights, we propose Argus, a novel 3-D multimodal framework that leverages multiview images for enhanced 3-D scene understanding with LLMs. In general, Argus can be treated as a 3-D large multimodal foundation model (3D-LMM) since it takes various modalities as input (text instructions, 2-D multiview images, and 3-D point clouds) and expands the capability of LLMs to tackle 3-D tasks. Argus involves fusing and integrating multiview images and camera poses into view-as-scene features, which interact with the 3-D features to create comprehensive and detailed 3-D-aware scene embeddings. Our approach compensates for the information loss while reconstructing 3-D point clouds and helps LLMs better understand the 3-D world. Extensive experiments demonstrate that our method outperforms existing 3D-LMMs in various downstream tasks.","PeriodicalId":13303,"journal":{"name":"IEEE transactions on neural networks and learning systems","volume":"21 1","pages":""},"PeriodicalIF":8.9000,"publicationDate":"2025-06-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Argus: Leveraging Multiview Images for Improved 3-D Scene Understanding With Large Language Models.\",\"authors\":\"Yifan Xu,Chao Zhang,Hanqi Jiang,Xiaoyan Wang,Ruifei Ma,Yiwei Li,Zihao Wu,Zeju Li,Xiangde Liu\",\"doi\":\"10.1109/tnnls.2025.3581411\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Advancements in foundation models have made it possible to conduct applications in various downstream tasks. Especially, the new era has witnessed a remarkable capability to extend large language models (LLMs) for tackling tasks of 3-D scene understanding. Current methods rely heavily on 3-D point clouds, but the 3-D point cloud reconstruction of an indoor scene often results in information loss. Some textureless planes or repetitive patterns are prone to omission and manifest as voids within the reconstructed 3-D point clouds. Besides, objects with complex structures tend to introduce distortion of details caused by misalignments between the captured images and the dense reconstructed point clouds. The 2-D multiview images present visual consistency with 3-D point clouds and provide more detailed representations of scene components, which can naturally compensate for these deficiencies. Based on these insights, we propose Argus, a novel 3-D multimodal framework that leverages multiview images for enhanced 3-D scene understanding with LLMs. In general, Argus can be treated as a 3-D large multimodal foundation model (3D-LMM) since it takes various modalities as input (text instructions, 2-D multiview images, and 3-D point clouds) and expands the capability of LLMs to tackle 3-D tasks. Argus involves fusing and integrating multiview images and camera poses into view-as-scene features, which interact with the 3-D features to create comprehensive and detailed 3-D-aware scene embeddings. Our approach compensates for the information loss while reconstructing 3-D point clouds and helps LLMs better understand the 3-D world. Extensive experiments demonstrate that our method outperforms existing 3D-LMMs in various downstream tasks.\",\"PeriodicalId\":13303,\"journal\":{\"name\":\"IEEE transactions on neural networks and learning systems\",\"volume\":\"21 1\",\"pages\":\"\"},\"PeriodicalIF\":8.9000,\"publicationDate\":\"2025-06-25\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE transactions on neural networks and learning systems\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/10.1109/tnnls.2025.3581411\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on neural networks and learning systems","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1109/tnnls.2025.3581411","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

基础模型的进步使得在各种下游任务中进行应用成为可能。特别是，新时代已经见证了扩展大型语言模型（llm）来处理3-D场景理解任务的显着能力。目前的方法严重依赖于三维点云，但室内场景的三维点云重建往往会导致信息丢失。在重建的三维点云中，一些无纹理的平面或重复的图案容易被遗漏，并表现为空洞。此外，结构复杂的目标容易因捕获图像与密集重构点云之间的不对准而导致细节失真。二维多视图图像与三维点云具有视觉一致性，并提供更详细的场景组件表示，可以自然地弥补这些不足。基于这些见解，我们提出了Argus，这是一个新颖的3-D多模态框架，利用多视图图像增强llm对3-D场景的理解。总的来说，Argus可以被视为3d大多模态基础模型（3D-LMM），因为它采用多种模态作为输入（文本指令、2d多视图图像和3d点云），扩展了llm处理3d任务的能力。Argus涉及融合和整合多视图图像和相机姿势到视图作为场景的功能，这些功能与3-D功能交互，以创建全面和详细的3-D感知场景嵌入。我们的方法弥补了重建三维点云时的信息损失，并帮助法学硕士更好地理解三维世界。大量的实验表明，我们的方法在各种下游任务中优于现有的3d - lmm。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Argus: Leveraging Multiview Images for Improved 3-D Scene Understanding With Large Language Models.

Advancements in foundation models have made it possible to conduct applications in various downstream tasks. Especially, the new era has witnessed a remarkable capability to extend large language models (LLMs) for tackling tasks of 3-D scene understanding. Current methods rely heavily on 3-D point clouds, but the 3-D point cloud reconstruction of an indoor scene often results in information loss. Some textureless planes or repetitive patterns are prone to omission and manifest as voids within the reconstructed 3-D point clouds. Besides, objects with complex structures tend to introduce distortion of details caused by misalignments between the captured images and the dense reconstructed point clouds. The 2-D multiview images present visual consistency with 3-D point clouds and provide more detailed representations of scene components, which can naturally compensate for these deficiencies. Based on these insights, we propose Argus, a novel 3-D multimodal framework that leverages multiview images for enhanced 3-D scene understanding with LLMs. In general, Argus can be treated as a 3-D large multimodal foundation model (3D-LMM) since it takes various modalities as input (text instructions, 2-D multiview images, and 3-D point clouds) and expands the capability of LLMs to tackle 3-D tasks. Argus involves fusing and integrating multiview images and camera poses into view-as-scene features, which interact with the 3-D features to create comprehensive and detailed 3-D-aware scene embeddings. Our approach compensates for the information loss while reconstructing 3-D point clouds and helps LLMs better understand the 3-D world. Extensive experiments demonstrate that our method outperforms existing 3D-LMMs in various downstream tasks.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE transactions on neural networks and learning systems COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE-COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

CiteScore

23.80

自引率

9.60%

发文量

2102

审稿时长

3-8 weeks

期刊介绍： The focus of IEEE Transactions on Neural Networks and Learning Systems is to present scholarly articles discussing the theory, design, and applications of neural networks as well as other learning systems. The journal primarily highlights technical and scientific research in this domain.