Sheng-Yu Huang;Chi-Pin Huang;Kai-Po Chang;Zi-Ting Chou;I-Jieh Liu;Yu-Chiang Frank Wang
{"title":"Learning Shape-Color Diffusion Priors for Text-Guided 3D Object Generation","authors":"Sheng-Yu Huang;Chi-Pin Huang;Kai-Po Chang;Zi-Ting Chou;I-Jieh Liu;Yu-Chiang Frank Wang","doi":"10.1109/TMM.2025.3535325","DOIUrl":null,"url":null,"abstract":"Generating 3D shapes according to specific textual input is a crucial topic in the multimedia application, with its potential enhancement to the VR/AR/XR usage that enables more diverse virtual scenes. Due to the recent success of diffusion models, text-guided 3D object generation has drawn a lot of attention recently. However, current latent diffusion-based methods are restricted to shape-only generation, requiring time-consuming and computationally expensive post-processing to obtain colored objects. In this paper, we propose an end-to-end <italic>Shape-Color Diffusion Prior framework (SCDiff)</i> to achieve colored text-to-3D object generation. Given a general text description as input, our SCDiff is able to distinguish shape and color-related priors in the text and generate a shape latent and a color latent for a pre-trained 3D object auto-encoder to derive colored 3D objects. Our SCDiff contains two 3D latent diffusion models (LDM), where one generates the shape latent from the input text and the other generates the color latent. To help the two LDMs focus on shape/color-related information, we further adopt a Large Language Model (LLM) to separate the input text into a shape phrase and a color phrase via an in-context learning technique so that our shape/color LDM would not be influenced by irrelevant information. Due to the separation of shape and color latent, we are able to manipulate the color of an object by giving different color phrases while maintaining the original shape. Experiments on a benchmark dataset would quantitatively and qualitatively verify the effectiveness and practicality of our proposed model. As an extension, we show the capability of our SCDiff on 3D object generation and manipulation based on various modality conditions, which further confirms the scalability and applications in multimedia of our proposed framework.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"3294-3306"},"PeriodicalIF":9.7000,"publicationDate":"2025-01-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Multimedia","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10855435/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0
Abstract
Generating 3D shapes according to specific textual input is a crucial topic in the multimedia application, with its potential enhancement to the VR/AR/XR usage that enables more diverse virtual scenes. Due to the recent success of diffusion models, text-guided 3D object generation has drawn a lot of attention recently. However, current latent diffusion-based methods are restricted to shape-only generation, requiring time-consuming and computationally expensive post-processing to obtain colored objects. In this paper, we propose an end-to-end Shape-Color Diffusion Prior framework (SCDiff) to achieve colored text-to-3D object generation. Given a general text description as input, our SCDiff is able to distinguish shape and color-related priors in the text and generate a shape latent and a color latent for a pre-trained 3D object auto-encoder to derive colored 3D objects. Our SCDiff contains two 3D latent diffusion models (LDM), where one generates the shape latent from the input text and the other generates the color latent. To help the two LDMs focus on shape/color-related information, we further adopt a Large Language Model (LLM) to separate the input text into a shape phrase and a color phrase via an in-context learning technique so that our shape/color LDM would not be influenced by irrelevant information. Due to the separation of shape and color latent, we are able to manipulate the color of an object by giving different color phrases while maintaining the original shape. Experiments on a benchmark dataset would quantitatively and qualitatively verify the effectiveness and practicality of our proposed model. As an extension, we show the capability of our SCDiff on 3D object generation and manipulation based on various modality conditions, which further confirms the scalability and applications in multimedia of our proposed framework.
期刊介绍:
The IEEE Transactions on Multimedia delves into diverse aspects of multimedia technology and applications, covering circuits, networking, signal processing, systems, software, and systems integration. The scope aligns with the Fields of Interest of the sponsors, ensuring a comprehensive exploration of research in multimedia.