{"title":"MeshFormer: High-Quality Mesh Generation with 3D-Guided Reconstruction Model","authors":"Minghua Liu, Chong Zeng, Xinyue Wei, Ruoxi Shi, Linghao Chen, Chao Xu, Mengqi Zhang, Zhaoning Wang, Xiaoshuai Zhang, Isabella Liu, Hongzhi Wu, Hao Su","doi":"arxiv-2408.10198","DOIUrl":null,"url":null,"abstract":"Open-world 3D reconstruction models have recently garnered significant\nattention. However, without sufficient 3D inductive bias, existing methods\ntypically entail expensive training costs and struggle to extract high-quality\n3D meshes. In this work, we introduce MeshFormer, a sparse-view reconstruction\nmodel that explicitly leverages 3D native structure, input guidance, and\ntraining supervision. Specifically, instead of using a triplane representation,\nwe store features in 3D sparse voxels and combine transformers with 3D\nconvolutions to leverage an explicit 3D structure and projective bias. In\naddition to sparse-view RGB input, we require the network to take input and\ngenerate corresponding normal maps. The input normal maps can be predicted by\n2D diffusion models, significantly aiding in the guidance and refinement of the\ngeometry's learning. Moreover, by combining Signed Distance Function (SDF)\nsupervision with surface rendering, we directly learn to generate high-quality\nmeshes without the need for complex multi-stage training processes. By\nincorporating these explicit 3D biases, MeshFormer can be trained efficiently\nand deliver high-quality textured meshes with fine-grained geometric details.\nIt can also be integrated with 2D diffusion models to enable fast\nsingle-image-to-3D and text-to-3D tasks. Project page:\nhttps://meshformer3d.github.io","PeriodicalId":501174,"journal":{"name":"arXiv - CS - Graphics","volume":"29 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-08-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Graphics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2408.10198","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Open-world 3D reconstruction models have recently garnered significant
attention. However, without sufficient 3D inductive bias, existing methods
typically entail expensive training costs and struggle to extract high-quality
3D meshes. In this work, we introduce MeshFormer, a sparse-view reconstruction
model that explicitly leverages 3D native structure, input guidance, and
training supervision. Specifically, instead of using a triplane representation,
we store features in 3D sparse voxels and combine transformers with 3D
convolutions to leverage an explicit 3D structure and projective bias. In
addition to sparse-view RGB input, we require the network to take input and
generate corresponding normal maps. The input normal maps can be predicted by
2D diffusion models, significantly aiding in the guidance and refinement of the
geometry's learning. Moreover, by combining Signed Distance Function (SDF)
supervision with surface rendering, we directly learn to generate high-quality
meshes without the need for complex multi-stage training processes. By
incorporating these explicit 3D biases, MeshFormer can be trained efficiently
and deliver high-quality textured meshes with fine-grained geometric details.
It can also be integrated with 2D diffusion models to enable fast
single-image-to-3D and text-to-3D tasks. Project page:
https://meshformer3d.github.io