Learning Shape-Color Diffusion Priors for Text-Guided 3D Object Generation

IF 9.7 1区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Multimedia Pub Date : 2025-01-27 DOI:10.1109/TMM.2025.3535325

Sheng-Yu Huang;Chi-Pin Huang;Kai-Po Chang;Zi-Ting Chou;I-Jieh Liu;Yu-Chiang Frank Wang

{"title":"Learning Shape-Color Diffusion Priors for Text-Guided 3D Object Generation","authors":"Sheng-Yu Huang;Chi-Pin Huang;Kai-Po Chang;Zi-Ting Chou;I-Jieh Liu;Yu-Chiang Frank Wang","doi":"10.1109/TMM.2025.3535325","DOIUrl":null,"url":null,"abstract":"Generating 3D shapes according to specific textual input is a crucial topic in the multimedia application, with its potential enhancement to the VR/AR/XR usage that enables more diverse virtual scenes. Due to the recent success of diffusion models, text-guided 3D object generation has drawn a lot of attention recently. However, current latent diffusion-based methods are restricted to shape-only generation, requiring time-consuming and computationally expensive post-processing to obtain colored objects. In this paper, we propose an end-to-end <italic>Shape-Color Diffusion Prior framework (SCDiff)</i> to achieve colored text-to-3D object generation. Given a general text description as input, our SCDiff is able to distinguish shape and color-related priors in the text and generate a shape latent and a color latent for a pre-trained 3D object auto-encoder to derive colored 3D objects. Our SCDiff contains two 3D latent diffusion models (LDM), where one generates the shape latent from the input text and the other generates the color latent. To help the two LDMs focus on shape/color-related information, we further adopt a Large Language Model (LLM) to separate the input text into a shape phrase and a color phrase via an in-context learning technique so that our shape/color LDM would not be influenced by irrelevant information. Due to the separation of shape and color latent, we are able to manipulate the color of an object by giving different color phrases while maintaining the original shape. Experiments on a benchmark dataset would quantitatively and qualitatively verify the effectiveness and practicality of our proposed model. As an extension, we show the capability of our SCDiff on 3D object generation and manipulation based on various modality conditions, which further confirms the scalability and applications in multimedia of our proposed framework.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"3294-3306"},"PeriodicalIF":9.7000,"publicationDate":"2025-01-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Multimedia","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10855435/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

Generating 3D shapes according to specific textual input is a crucial topic in the multimedia application, with its potential enhancement to the VR/AR/XR usage that enables more diverse virtual scenes. Due to the recent success of diffusion models, text-guided 3D object generation has drawn a lot of attention recently. However, current latent diffusion-based methods are restricted to shape-only generation, requiring time-consuming and computationally expensive post-processing to obtain colored objects. In this paper, we propose an end-to-end Shape-Color Diffusion Prior framework (SCDiff) to achieve colored text-to-3D object generation. Given a general text description as input, our SCDiff is able to distinguish shape and color-related priors in the text and generate a shape latent and a color latent for a pre-trained 3D object auto-encoder to derive colored 3D objects. Our SCDiff contains two 3D latent diffusion models (LDM), where one generates the shape latent from the input text and the other generates the color latent. To help the two LDMs focus on shape/color-related information, we further adopt a Large Language Model (LLM) to separate the input text into a shape phrase and a color phrase via an in-context learning technique so that our shape/color LDM would not be influenced by irrelevant information. Due to the separation of shape and color latent, we are able to manipulate the color of an object by giving different color phrases while maintaining the original shape. Experiments on a benchmark dataset would quantitatively and qualitatively verify the effectiveness and practicality of our proposed model. As an extension, we show the capability of our SCDiff on 3D object generation and manipulation based on various modality conditions, which further confirms the scalability and applications in multimedia of our proposed framework.

查看原文本刊更多论文

学习形状-颜色扩散先验的文本引导3D对象生成

根据特定的文本输入生成3D形状是多媒体应用中的一个关键主题，它有可能增强VR/AR/XR的使用，从而实现更多样化的虚拟场景。由于近年来扩散模型的成功，文本引导的三维物体生成引起了人们的广泛关注。然而，目前基于潜在扩散的方法仅限于形状生成，需要耗时和计算昂贵的后处理才能获得彩色物体。在本文中，我们提出了一个端到端的形状-颜色扩散先验框架（SCDiff）来实现彩色文本到3d对象的生成。给定一般的文本描述作为输入，我们的SCDiff能够区分文本中形状和颜色相关的先验，并为预训练的3D对象自动编码器生成形状潜和颜色潜，以派生彩色3D对象。我们的SCDiff包含两个3D潜扩散模型（LDM），其中一个从输入文本生成形状潜，另一个生成颜色潜。为了帮助两个LDM专注于形状/颜色相关信息，我们进一步采用大语言模型（LLM），通过上下文学习技术将输入文本分离为形状短语和颜色短语，从而使我们的形状/颜色LDM不会受到无关信息的影响。由于形状和色隐的分离，我们能够在保持物体原始形状的同时，通过赋予不同的颜色短语来操纵物体的颜色。在基准数据集上的实验将定量和定性地验证我们提出的模型的有效性和实用性。作为扩展，我们展示了基于各种模态条件的SCDiff三维对象生成和操作的能力，进一步证实了我们提出的框架的可扩展性和多媒体应用。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Transactions on Multimedia 工程技术-电信学

CiteScore

11.70

自引率

11.00%

发文量

576

审稿时长

5.5 months

期刊介绍： The IEEE Transactions on Multimedia delves into diverse aspects of multimedia technology and applications, covering circuits, networking, signal processing, systems, software, and systems integration. The scope aligns with the Fields of Interest of the sponsors, ensuring a comprehensive exploration of research in multimedia.