MOGAR：用于高保真3D资产生成的多视图光学和几何自适应细化

IF 15.5 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Information Fusion Pub Date : 2025-09-23 DOI:10.1016/j.inffus.2025.103763

Fu-Quan Zhang , Kai-Hong Chen , Tsu-Yang Wu , Yang Hong , Jia-Jun Zhu , Chao Chen , Lin-Juan Ma , Jia-Xin Xu

{"title":"MOGAR：用于高保真3D资产生成的多视图光学和几何自适应细化","authors":"Fu-Quan Zhang , Kai-Hong Chen , Tsu-Yang Wu , Yang Hong , Jia-Jun Zhu , Chao Chen , Lin-Juan Ma , Jia-Xin Xu","doi":"10.1016/j.inffus.2025.103763","DOIUrl":null,"url":null,"abstract":"<div><div>Multi-modal 3D asset generation from sparse-view inputs remains a core challenge in both computer vision and graphics due to the inherent difficulties in modelling multi-view geometric consistency and recovering high-frequency appearance details. While Convolutional Neural Networks (CNNs) and Transformers have demonstrated impressive capabilities in 3D generation, they suffer from significant limitations–CNNs struggle with capturing long-range dependencies and global multi-view coherence, whereas Transformers incur quadratic computational complexity and often yield view-inconsistent or structurally ambiguous outputs. Fortunately, recent advancements in state space models, particularly the Mamba architecture, have shown remarkable potential by combining long-range dependency modelling with linear computational efficiency. However, the original Mamba is inherently constrained to unidirectional causal sequence modelling, making it suboptimal for high-dimensional visual scenarios. To address this, we propose MOGAR (Multi-View Optical and Geometry Adaptive Refinement), a novel and efficient multi-modal framework for 3D asset generation. MOGAR introduces the Multi-view Guided Selective Mamba (MvGSM) module as its core, enabling cross-directional and cross-scale alignment and integration of geometric and optical features. By synergistically combining feed-forward coarse asset generation, multi-view structural optimisation, optical attribute prediction, and cross-modal detail refinement via a UNet architecture, MOGAR achieves a tightly coupled reasoning pipeline from global structure to fine-grained details. We conduct extensive evaluations on several standard benchmarks, demonstrating that MOGAR consistently outperforms existing approaches in terms of geometric accuracy, rendering fidelity, and cross-view consistency, establishing a new paradigm for efficient and high-quality 3D asset generation under sparse input settings.</div></div>","PeriodicalId":50367,"journal":{"name":"Information Fusion","volume":"127 ","pages":"Article 103763"},"PeriodicalIF":15.5000,"publicationDate":"2025-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"MOGAR: Multi-view optical and geometry adaptive refinement for high-fidelity 3D asset generation\",\"authors\":\"Fu-Quan Zhang , Kai-Hong Chen , Tsu-Yang Wu , Yang Hong , Jia-Jun Zhu , Chao Chen , Lin-Juan Ma , Jia-Xin Xu\",\"doi\":\"10.1016/j.inffus.2025.103763\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Multi-modal 3D asset generation from sparse-view inputs remains a core challenge in both computer vision and graphics due to the inherent difficulties in modelling multi-view geometric consistency and recovering high-frequency appearance details. While Convolutional Neural Networks (CNNs) and Transformers have demonstrated impressive capabilities in 3D generation, they suffer from significant limitations–CNNs struggle with capturing long-range dependencies and global multi-view coherence, whereas Transformers incur quadratic computational complexity and often yield view-inconsistent or structurally ambiguous outputs. Fortunately, recent advancements in state space models, particularly the Mamba architecture, have shown remarkable potential by combining long-range dependency modelling with linear computational efficiency. However, the original Mamba is inherently constrained to unidirectional causal sequence modelling, making it suboptimal for high-dimensional visual scenarios. To address this, we propose MOGAR (Multi-View Optical and Geometry Adaptive Refinement), a novel and efficient multi-modal framework for 3D asset generation. MOGAR introduces the Multi-view Guided Selective Mamba (MvGSM) module as its core, enabling cross-directional and cross-scale alignment and integration of geometric and optical features. By synergistically combining feed-forward coarse asset generation, multi-view structural optimisation, optical attribute prediction, and cross-modal detail refinement via a UNet architecture, MOGAR achieves a tightly coupled reasoning pipeline from global structure to fine-grained details. We conduct extensive evaluations on several standard benchmarks, demonstrating that MOGAR consistently outperforms existing approaches in terms of geometric accuracy, rendering fidelity, and cross-view consistency, establishing a new paradigm for efficient and high-quality 3D asset generation under sparse input settings.</div></div>\",\"PeriodicalId\":50367,\"journal\":{\"name\":\"Information Fusion\",\"volume\":\"127 \",\"pages\":\"Article 103763\"},\"PeriodicalIF\":15.5000,\"publicationDate\":\"2025-09-23\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Information Fusion\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S1566253525008255\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information Fusion","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1566253525008255","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

由于建模多视图几何一致性和恢复高频外观细节的固有困难，从稀疏视图输入生成多模态3D资产仍然是计算机视觉和图形学的核心挑战。虽然卷积神经网络（cnn）和变形金刚在3D生成方面表现出了令人印象深刻的能力，但它们都有明显的局限性——cnn难以捕捉远程依赖关系和全局多视图一致性，而变形金刚则需要二次计算复杂度，并且经常产生视图不一致或结构模糊的输出。幸运的是，状态空间模型的最新进展，特别是Mamba体系结构，通过将远程依赖关系建模与线性计算效率相结合，显示出了显著的潜力。然而，原始的曼巴固有地受限于单向因果序列建模，使其不适合高维视觉场景。为了解决这个问题，我们提出了MOGAR（多视图光学和几何自适应细化），这是一种用于3D资产生成的新颖高效的多模态框架。MOGAR引入了多视图引导选择曼巴（MvGSM）模块作为其核心，实现了交叉方向和跨尺度的对齐以及几何和光学特征的集成。通过通过UNet架构协同结合前馈粗资产生成、多视图结构优化、光学属性预测和跨模态细节细化，MOGAR实现了从全局结构到细粒度细节的紧密耦合推理管道。我们对几个标准基准进行了广泛的评估，证明MOGAR在几何精度、渲染保真度和交叉视图一致性方面始终优于现有方法，为稀疏输入设置下高效、高质量的3D资产生成建立了新的范例。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

MOGAR: Multi-view optical and geometry adaptive refinement for high-fidelity 3D asset generation

Multi-modal 3D asset generation from sparse-view inputs remains a core challenge in both computer vision and graphics due to the inherent difficulties in modelling multi-view geometric consistency and recovering high-frequency appearance details. While Convolutional Neural Networks (CNNs) and Transformers have demonstrated impressive capabilities in 3D generation, they suffer from significant limitations–CNNs struggle with capturing long-range dependencies and global multi-view coherence, whereas Transformers incur quadratic computational complexity and often yield view-inconsistent or structurally ambiguous outputs. Fortunately, recent advancements in state space models, particularly the Mamba architecture, have shown remarkable potential by combining long-range dependency modelling with linear computational efficiency. However, the original Mamba is inherently constrained to unidirectional causal sequence modelling, making it suboptimal for high-dimensional visual scenarios. To address this, we propose MOGAR (Multi-View Optical and Geometry Adaptive Refinement), a novel and efficient multi-modal framework for 3D asset generation. MOGAR introduces the Multi-view Guided Selective Mamba (MvGSM) module as its core, enabling cross-directional and cross-scale alignment and integration of geometric and optical features. By synergistically combining feed-forward coarse asset generation, multi-view structural optimisation, optical attribute prediction, and cross-modal detail refinement via a UNet architecture, MOGAR achieves a tightly coupled reasoning pipeline from global structure to fine-grained details. We conduct extensive evaluations on several standard benchmarks, demonstrating that MOGAR consistently outperforms existing approaches in terms of geometric accuracy, rendering fidelity, and cross-view consistency, establishing a new paradigm for efficient and high-quality 3D asset generation under sparse input settings.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Information Fusion 工程技术-计算机：理论方法

CiteScore

33.20

自引率

4.30%

发文量

161

审稿时长

7.9 months

期刊介绍： Information Fusion serves as a central platform for showcasing advancements in multi-sensor, multi-source, multi-process information fusion, fostering collaboration among diverse disciplines driving its progress. It is the leading outlet for sharing research and development in this field, focusing on architectures, algorithms, and applications. Papers dealing with fundamental theoretical analyses as well as those demonstrating their application to real-world problems will be welcome.