Fu-Quan Zhang , Kai-Hong Chen , Tsu-Yang Wu , Yang Hong , Jia-Jun Zhu , Chao Chen , Lin-Juan Ma , Jia-Xin Xu
{"title":"MOGAR:用于高保真3D资产生成的多视图光学和几何自适应细化","authors":"Fu-Quan Zhang , Kai-Hong Chen , Tsu-Yang Wu , Yang Hong , Jia-Jun Zhu , Chao Chen , Lin-Juan Ma , Jia-Xin Xu","doi":"10.1016/j.inffus.2025.103763","DOIUrl":null,"url":null,"abstract":"<div><div>Multi-modal 3D asset generation from sparse-view inputs remains a core challenge in both computer vision and graphics due to the inherent difficulties in modelling multi-view geometric consistency and recovering high-frequency appearance details. While Convolutional Neural Networks (CNNs) and Transformers have demonstrated impressive capabilities in 3D generation, they suffer from significant limitations–CNNs struggle with capturing long-range dependencies and global multi-view coherence, whereas Transformers incur quadratic computational complexity and often yield view-inconsistent or structurally ambiguous outputs. Fortunately, recent advancements in state space models, particularly the Mamba architecture, have shown remarkable potential by combining long-range dependency modelling with linear computational efficiency. However, the original Mamba is inherently constrained to unidirectional causal sequence modelling, making it suboptimal for high-dimensional visual scenarios. To address this, we propose MOGAR (Multi-View Optical and Geometry Adaptive Refinement), a novel and efficient multi-modal framework for 3D asset generation. MOGAR introduces the Multi-view Guided Selective Mamba (MvGSM) module as its core, enabling cross-directional and cross-scale alignment and integration of geometric and optical features. By synergistically combining feed-forward coarse asset generation, multi-view structural optimisation, optical attribute prediction, and cross-modal detail refinement via a UNet architecture, MOGAR achieves a tightly coupled reasoning pipeline from global structure to fine-grained details. We conduct extensive evaluations on several standard benchmarks, demonstrating that MOGAR consistently outperforms existing approaches in terms of geometric accuracy, rendering fidelity, and cross-view consistency, establishing a new paradigm for efficient and high-quality 3D asset generation under sparse input settings.</div></div>","PeriodicalId":50367,"journal":{"name":"Information Fusion","volume":"127 ","pages":"Article 103763"},"PeriodicalIF":15.5000,"publicationDate":"2025-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"MOGAR: Multi-view optical and geometry adaptive refinement for high-fidelity 3D asset generation\",\"authors\":\"Fu-Quan Zhang , Kai-Hong Chen , Tsu-Yang Wu , Yang Hong , Jia-Jun Zhu , Chao Chen , Lin-Juan Ma , Jia-Xin Xu\",\"doi\":\"10.1016/j.inffus.2025.103763\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Multi-modal 3D asset generation from sparse-view inputs remains a core challenge in both computer vision and graphics due to the inherent difficulties in modelling multi-view geometric consistency and recovering high-frequency appearance details. While Convolutional Neural Networks (CNNs) and Transformers have demonstrated impressive capabilities in 3D generation, they suffer from significant limitations–CNNs struggle with capturing long-range dependencies and global multi-view coherence, whereas Transformers incur quadratic computational complexity and often yield view-inconsistent or structurally ambiguous outputs. Fortunately, recent advancements in state space models, particularly the Mamba architecture, have shown remarkable potential by combining long-range dependency modelling with linear computational efficiency. However, the original Mamba is inherently constrained to unidirectional causal sequence modelling, making it suboptimal for high-dimensional visual scenarios. To address this, we propose MOGAR (Multi-View Optical and Geometry Adaptive Refinement), a novel and efficient multi-modal framework for 3D asset generation. MOGAR introduces the Multi-view Guided Selective Mamba (MvGSM) module as its core, enabling cross-directional and cross-scale alignment and integration of geometric and optical features. By synergistically combining feed-forward coarse asset generation, multi-view structural optimisation, optical attribute prediction, and cross-modal detail refinement via a UNet architecture, MOGAR achieves a tightly coupled reasoning pipeline from global structure to fine-grained details. We conduct extensive evaluations on several standard benchmarks, demonstrating that MOGAR consistently outperforms existing approaches in terms of geometric accuracy, rendering fidelity, and cross-view consistency, establishing a new paradigm for efficient and high-quality 3D asset generation under sparse input settings.</div></div>\",\"PeriodicalId\":50367,\"journal\":{\"name\":\"Information Fusion\",\"volume\":\"127 \",\"pages\":\"Article 103763\"},\"PeriodicalIF\":15.5000,\"publicationDate\":\"2025-09-23\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Information Fusion\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S1566253525008255\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information Fusion","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1566253525008255","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
MOGAR: Multi-view optical and geometry adaptive refinement for high-fidelity 3D asset generation
Multi-modal 3D asset generation from sparse-view inputs remains a core challenge in both computer vision and graphics due to the inherent difficulties in modelling multi-view geometric consistency and recovering high-frequency appearance details. While Convolutional Neural Networks (CNNs) and Transformers have demonstrated impressive capabilities in 3D generation, they suffer from significant limitations–CNNs struggle with capturing long-range dependencies and global multi-view coherence, whereas Transformers incur quadratic computational complexity and often yield view-inconsistent or structurally ambiguous outputs. Fortunately, recent advancements in state space models, particularly the Mamba architecture, have shown remarkable potential by combining long-range dependency modelling with linear computational efficiency. However, the original Mamba is inherently constrained to unidirectional causal sequence modelling, making it suboptimal for high-dimensional visual scenarios. To address this, we propose MOGAR (Multi-View Optical and Geometry Adaptive Refinement), a novel and efficient multi-modal framework for 3D asset generation. MOGAR introduces the Multi-view Guided Selective Mamba (MvGSM) module as its core, enabling cross-directional and cross-scale alignment and integration of geometric and optical features. By synergistically combining feed-forward coarse asset generation, multi-view structural optimisation, optical attribute prediction, and cross-modal detail refinement via a UNet architecture, MOGAR achieves a tightly coupled reasoning pipeline from global structure to fine-grained details. We conduct extensive evaluations on several standard benchmarks, demonstrating that MOGAR consistently outperforms existing approaches in terms of geometric accuracy, rendering fidelity, and cross-view consistency, establishing a new paradigm for efficient and high-quality 3D asset generation under sparse input settings.
期刊介绍:
Information Fusion serves as a central platform for showcasing advancements in multi-sensor, multi-source, multi-process information fusion, fostering collaboration among diverse disciplines driving its progress. It is the leading outlet for sharing research and development in this field, focusing on architectures, algorithms, and applications. Papers dealing with fundamental theoretical analyses as well as those demonstrating their application to real-world problems will be welcome.