Text-to-3D scene generation framework: bridging textual descriptions to high-fidelity 3D scenes.

IF 6 4区计算机科学 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS

Visual Computing for Industry Biomedicine and Art Pub Date : 2025-12-18 DOI:10.1186/s42492-025-00210-0

Zuan Gu, Tianhan Gao, Huimin Liu

{"title":"Text-to-3D scene generation framework: bridging textual descriptions to high-fidelity 3D scenes.","authors":"Zuan Gu, Tianhan Gao, Huimin Liu","doi":"10.1186/s42492-025-00210-0","DOIUrl":null,"url":null,"abstract":"<p><p>Text-to-3D scene generation is pivotal for digital content creation; however, existing methods often struggle with global consistency across views. We present 3DS-Gen, a modular \"generate-then-reconstruct\" framework that first produces a temporally coherent multi-view video prior and then reconstructs consistent 3D scenes using sparse geometry estimation and Gaussian optimization. A cascaded variational autoencoder (2D for spatial compression and 3D for temporal compression) provides a compact and coherent latent sequence that facilitates robust reconstruction. An adaptive density threshold improves detailed allocation in the Gaussian stage under a fixed computational budget. While explicit meshes can be extracted from the optimized representation when needed, our claims emphasize multiview consistency and reconstructability; the mesh quality depends on the video prior and the chosen explicitification backend. 3DS-Gen runs on a single GPU and yields coherent scene reconstructions across diverse prompts, thereby providing a practical bridge between text and 3D content creation.</p>","PeriodicalId":29931,"journal":{"name":"Visual Computing for Industry Biomedicine and Art","volume":"8 1","pages":"29"},"PeriodicalIF":6.0000,"publicationDate":"2025-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12712286/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Visual Computing for Industry Biomedicine and Art","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1186/s42492-025-00210-0","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}

引用次数: 0

Abstract

Text-to-3D scene generation is pivotal for digital content creation; however, existing methods often struggle with global consistency across views. We present 3DS-Gen, a modular "generate-then-reconstruct" framework that first produces a temporally coherent multi-view video prior and then reconstructs consistent 3D scenes using sparse geometry estimation and Gaussian optimization. A cascaded variational autoencoder (2D for spatial compression and 3D for temporal compression) provides a compact and coherent latent sequence that facilitates robust reconstruction. An adaptive density threshold improves detailed allocation in the Gaussian stage under a fixed computational budget. While explicit meshes can be extracted from the optimized representation when needed, our claims emphasize multiview consistency and reconstructability; the mesh quality depends on the video prior and the chosen explicitification backend. 3DS-Gen runs on a single GPU and yields coherent scene reconstructions across diverse prompts, thereby providing a practical bridge between text and 3D content creation.

查看原文本刊更多论文

文本到3D场景生成框架：桥接文本描述到高保真3D场景。

文本到3d场景生成是数字内容创作的关键；然而，现有的方法经常与跨视图的全局一致性作斗争。我们提出了3DS-Gen，一个模块化的“生成-然后重建”框架，首先产生一个暂时连贯的多视图视频，然后使用稀疏几何估计和高斯优化重建一致的3D场景。级联变分自编码器（2D用于空间压缩，3D用于时间压缩）提供紧凑一致的潜在序列，促进鲁棒重建。在固定的计算预算下，自适应密度阈值改善了高斯阶段的详细分配。虽然可以在需要时从优化的表示中提取显式网格，但我们的要求强调多视图一致性和可重构性；网格质量取决于视频先验和选择的显式后端。3DS-Gen在单个GPU上运行，并在不同的提示中产生连贯的场景重建，从而在文本和3D内容创建之间提供实用的桥梁。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Visual Computing for Industry Biomedicine and Art Multiple-

CiteScore

5.60

自引率

0.00%

发文量