IEEE Transactions on Multimedia最新文献

筛选
英文 中文
Latent Watermark: Inject and Detect Watermarks in Latent Diffusion Space 潜在水印:在潜在扩散空间中注入和检测水印
IF 8.4 1区 计算机科学
IEEE Transactions on Multimedia Pub Date : 2025-01-27 DOI: 10.1109/TMM.2025.3535300
Zheling Meng;Bo Peng;Jing Dong
{"title":"Latent Watermark: Inject and Detect Watermarks in Latent Diffusion Space","authors":"Zheling Meng;Bo Peng;Jing Dong","doi":"10.1109/TMM.2025.3535300","DOIUrl":"https://doi.org/10.1109/TMM.2025.3535300","url":null,"abstract":"Watermarking is a tool for actively identifying and attributing the images generated by latent diffusion models. Existing methods face the dilemma of image quality and watermark robustness. Watermarks with superior image quality usually have inferior robustness against attacks such as blurring and JPEG compression, while watermarks with superior robustness usually significantly damage image quality. This dilemma stems from the traditional paradigm where watermarks are injected and detected in pixel space, relying on pixel perturbation for watermark detection and resilience against attacks. In this paper, we highlight that an effective solution to the problem is to both inject and detect watermarks in the latent diffusion space, and propose Latent Watermark with a progressive training strategy. It weakens the direct connection between quality and robustness and thus alleviates their contradiction. We conduct evaluations on two datasets and against 10 watermark attacks. Six metrics measure the image quality and watermark robustness. Results show that compared to the recently proposed methods such as StableSignature, StegaStamp, RoSteALS, LaWa, TreeRing, and DiffuseTrace, LW not only surpasses them in terms of robustness but also offers superior image quality.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"3399-3410"},"PeriodicalIF":8.4,"publicationDate":"2025-01-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144264150","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
MISF-Net: Modality-Invariant and -Specific Fusion Network for RGB-T Crowd Counting MISF-Net:用于RGB-T人群计数的模态不变和特定融合网络
IF 8.4 1区 计算机科学
IEEE Transactions on Multimedia Pub Date : 2025-01-27 DOI: 10.1109/TMM.2025.3535330
Baoyang Mu;Feng Shao;Zhengxuan Xie;Hangwei Chen;Zhongjie Zhu;Qiuping Jiang
{"title":"MISF-Net: Modality-Invariant and -Specific Fusion Network for RGB-T Crowd Counting","authors":"Baoyang Mu;Feng Shao;Zhengxuan Xie;Hangwei Chen;Zhongjie Zhu;Qiuping Jiang","doi":"10.1109/TMM.2025.3535330","DOIUrl":"https://doi.org/10.1109/TMM.2025.3535330","url":null,"abstract":"To accurately perform crowd counting, utilizing the complementary relationship between RGB and thermal images to analyze the crowd has become the focus of current research. Due to different imaging principles, multi-modal images often contain different contents, which are their modality-specific information. For example, RGB images contain more texture and color details, while thermal images contain thermal radiation information. Meanwhile, they also describe the same target content, e.g., crowds, which are modality-invariant. However, existing methods only design different modules to directly fuse RGB and thermal image features, which did not fully consider the above facts. In this paper, by analyzing the similarities and differences between multi-modal images, we propose a Modality-Invariant and -Specific Fusion Network (MISF-Net) for RGB-T Crowd Counting. Specifically, we design a modality decomposition and fusion module (MDFM), which decomposes RGB and thermal image features into modality-invariant and -specific features by using the similarity and difference supervision between multi-modal features. Besides, reconstruction supervision is also used to prevent network learning from generating bias. After that, different fusion strategies are applied to the invariant and specific features, respectively. In addition, to adapt to the variations in size of different pedestrians, we design a modality-invariant fusion module (MIFM). Finally, after the fusion decoder, MISF-Net can obtain a more accurate crowd density map. Comprehensive experiments on the RGB-T crowd counting dataset show that our MISF-Net can achieve competitive performance.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"2593-2607"},"PeriodicalIF":8.4,"publicationDate":"2025-01-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143949280","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Relighting From a Single Image: Datasets and Deep Intrinsic-Based Architecture 从单个图像重新照明:数据集和基于深层内在的架构
IF 8.4 1区 计算机科学
IEEE Transactions on Multimedia Pub Date : 2025-01-27 DOI: 10.1109/TMM.2025.3535397
Yixiong Yang;Hassan Ahmed Sial;Ramon Baldrich;Maria Vanrell
{"title":"Relighting From a Single Image: Datasets and Deep Intrinsic-Based Architecture","authors":"Yixiong Yang;Hassan Ahmed Sial;Ramon Baldrich;Maria Vanrell","doi":"10.1109/TMM.2025.3535397","DOIUrl":"https://doi.org/10.1109/TMM.2025.3535397","url":null,"abstract":"Single image scene relighting aims to generate a realistic new version of an input image so that it appears to be illuminated by a new target light condition. Although existing works have explored this problem from various perspectives, generating relit images under arbitrary light conditions remains highly challenging, and related datasets are scarce. Our work addresses this problem from both the dataset and methodological perspectives. We propose two new datasets: a synthetic dataset with the ground truth of intrinsic components and a real dataset collected under laboratory conditions. These datasets alleviate the scarcity of existing datasets. To incorporate physical consistency in the relighting pipeline, we establish a two-stage network based on intrinsic decomposition, giving outputs at intermediate steps, thereby introducing physical constraints. When the training set lacks ground truth for intrinsic decomposition, we introduce an unsupervised module to ensure that the intrinsic outputs are satisfactory. Our method outperforms the state-of-the-art methods in performance, as tested on both existing datasets and our newly developed datasets. Furthermore, pretraining our method or other prior methods using our synthetic dataset can enhance their performance on other datasets. Since our method can accommodate any light conditions, it is capable of producing animated results.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"2608-2622"},"PeriodicalIF":8.4,"publicationDate":"2025-01-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143949283","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Universal Infrared Image Nonuniformity Correction via Stripe-Aware Attention Network 基于条纹感知注意网络的通用红外图像非均匀性校正
IF 8.4 1区 计算机科学
IEEE Transactions on Multimedia Pub Date : 2025-01-27 DOI: 10.1109/TMM.2025.3535366
Kangle Wu;Jun Huang;Yong Ma;Fan Fan;Jiayi Ma
{"title":"Universal Infrared Image Nonuniformity Correction via Stripe-Aware Attention Network","authors":"Kangle Wu;Jun Huang;Yong Ma;Fan Fan;Jiayi Ma","doi":"10.1109/TMM.2025.3535366","DOIUrl":"https://doi.org/10.1109/TMM.2025.3535366","url":null,"abstract":"Infrared image nonuniformity correction aims to remove the column-wise stripe noise. Most existing methods just consider stripe noise whereas failing to handle real captured nonuniformity, as directional characteristic of stripe is severely disrupted by random Gaussian noise. Moreover, deep learning-based methods proposed in recent years are blocked by limited receptive field thus cannot accurately distinguish vertical structure and vertical stripes. To address these issues, we propose a universal infrared image nonuniformity correction method based on stripe-aware attention network. We seek to improve the performance of our algorithm by first restoring the damaged stripe directional characteristics, then maximizing the utilization of the prior characteristics. On the one hand, we construct the two-stage framework, in which denoising network is firstly applied to eliminate Gaussian noise and preserve stripes as scene information. As a result, the prior directional characteristics are restored, thereby enhancing the ability of subsequent sub-network to perceive stripe noise. On the other hand, due to the distinct long-range pixel correlations of vertical structures and vertical textures, we introduce a column-wise stripe attention mechanism (CSA) that can capture long-range dependencies of target pixels in the vertical direction. This significantly improves the discriminative ability of algorithm towards vertical structures and stripes, with minimal computational cost. Extensive experiments show that the proposed method can achieve promising results and has better universality for different infrared scenarios.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"3383-3398"},"PeriodicalIF":8.4,"publicationDate":"2025-01-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144264247","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Asymptotics-Aware Multi-View Subspace Clustering 渐近感知的多视图子空间聚类
IF 8.4 1区 计算机科学
IEEE Transactions on Multimedia Pub Date : 2025-01-27 DOI: 10.1109/TMM.2025.3535402
Yesong Xu;Shuo Chen;Jun Li;Jian Yang
{"title":"Asymptotics-Aware Multi-View Subspace Clustering","authors":"Yesong Xu;Shuo Chen;Jun Li;Jian Yang","doi":"10.1109/TMM.2025.3535402","DOIUrl":"https://doi.org/10.1109/TMM.2025.3535402","url":null,"abstract":"Recently, multi-view subspace clustering has attracted extensive attention due to the rapid increase of multi-view data in many real-world applications. The main goal of this task is to learn a common representation of multiple subspaces from the given multi-view data, and most existing methods usually directly merge multiple groups of features by the single-step integration. However, there may exist large disparities among different views of the data, and thus the conventional single-step practice can hardly obtain a generally consistent feature representation for the multi-view data. To overcome this challenge, we present a novel approach dubbed “Asymptotics-Aware Multi-view Subspace Clustering (A<inline-formula><tex-math>$^{2}$</tex-math></inline-formula>MSC)” to pursue a consistent feature representation in a multi-step way, which iteratively conducts the data recovery to gradually reduce the differences between pairwise views. Specifically, we construct an asymptotic learning rule to update the feature representation, and the iteration result converges to a consistent feature vector for characterizing each instance of the original multi-view data. After that, we utilize such a new feature representation to learn a clustering-oriented similarity matrix via minimizing a self-expressive objective, and we also design the corresponding optimization algorithm to solve it with convergence guarantees. Theoretically, we prove that the learned asymptotic representation effectively integrates multiple views, thereby ensuring the effective handling of multi-view data. Empirically, extensive experimental results demonstrate the superiority of our proposed A<inline-formula><tex-math>$^{2}$</tex-math></inline-formula>MSC over the state-of-the-art multi-view subspace clustering approaches.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"3650-3663"},"PeriodicalIF":8.4,"publicationDate":"2025-01-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144264285","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Towards Depth-Continuous Scene Representation With a Displacement Field for Robust Light Field Depth Estimation 基于位移场的场景深度连续表示鲁棒光场深度估计
IF 8.4 1区 计算机科学
IEEE Transactions on Multimedia Pub Date : 2025-01-27 DOI: 10.1109/TMM.2025.3535352
Rongshan Chen;Hao Sheng;Da Yang;Ruixuan Cong;Zhenglong Cui;Sizhe Wang;Tun Wang;Mingyuan Zhao
{"title":"Towards Depth-Continuous Scene Representation With a Displacement Field for Robust Light Field Depth Estimation","authors":"Rongshan Chen;Hao Sheng;Da Yang;Ruixuan Cong;Zhenglong Cui;Sizhe Wang;Tun Wang;Mingyuan Zhao","doi":"10.1109/TMM.2025.3535352","DOIUrl":"https://doi.org/10.1109/TMM.2025.3535352","url":null,"abstract":"Light field (LF) captures both spatial and angular information of scenes, enabling accurate depth estimation. However, previous deep learning methods have typically model surface depth only, while ignoring the continuous nature of depth in 3D scenes. In this paper, we use displacement field (DF) to describe this continuous property, and propose a novel depth-continuous scene representation for robust LF depth estimation. Experiments demonstrate that our representation enables the network to generate highly detailed depth maps with fewer parameters and faster speed. Specifically, inspired by signed distance field in 3D object description, we aim to exploit the intrinsic depth-continuous property of 3D scenes using DF, and define a novel depth-continuous scene representation. Then, we introduce a simple yet general learning framework for depth-continuous scene embedding, and the proposed network, DepthDF, achieves state-of-the-art performance on both synthetic and real-world LF datasets, ranking 1st on the HCI 4D Light Field benchmark. Furthermore, previous LF depth estimation methods can also be seamlessly integrated into this framework. Finally, we extend this framework beyond LF depth estimation to various tasks, including multi-view stereo depth inference, LF super-resolution, and LF salient object detection. Experiments demonstrate improved performance when the continuous scene representation is applied, suggesting that our framework can potentially bring insights to more fields.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"3637-3649"},"PeriodicalIF":8.4,"publicationDate":"2025-01-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144264159","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
All-in-One Weather-Degraded Image Restoration Via Adaptive Degradation-Aware Self-Prompting Model 基于自适应退化感知自提示模型的一体化天气退化图像恢复
IF 8.4 1区 计算机科学
IEEE Transactions on Multimedia Pub Date : 2025-01-27 DOI: 10.1109/TMM.2025.3535316
Yuanbo Wen;Tao Gao;Ziqi Li;Jing Zhang;Kaihao Zhang;Ting Chen
{"title":"All-in-One Weather-Degraded Image Restoration Via Adaptive Degradation-Aware Self-Prompting Model","authors":"Yuanbo Wen;Tao Gao;Ziqi Li;Jing Zhang;Kaihao Zhang;Ting Chen","doi":"10.1109/TMM.2025.3535316","DOIUrl":"https://doi.org/10.1109/TMM.2025.3535316","url":null,"abstract":"Existing approaches for all-in-one weather-degraded image restoration suffer from inefficiencies in leveraging degradation-aware priors, resulting in sub-optimal performance in adapting to different weather conditions. To this end, we develop an adaptive degradation-aware self-prompting model (ADSM) for all-in-one weather-degraded image restoration. Specifically, our model employs the contrastive language-image pre-training model (CLIP) to facilitate the training of our proposed latent prompt generators (LPGs), which represent three types of latent prompts to characterize the degradation type, degradation property and image caption. Moreover, we integrate the acquired degradation-aware prompts into the time embedding of diffusion model to improve degradation perception. Meanwhile, we employ the latent caption prompt to guide the reverse sampling process using the cross-attention mechanism, thereby guiding the accurate image reconstruction. Furthermore, to accelerate the reverse sampling procedure of diffusion model and address the limitations of frequency perception, we introduce a wavelet-oriented noise estimating network (WNE-Net). Extensive experiments conducted on eight publicly available datasets demonstrate the effectiveness of our proposed approach in both task-specific and all-in-one applications.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"3343-3355"},"PeriodicalIF":8.4,"publicationDate":"2025-01-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144264173","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
JPEG Reversible Data Hiding via Block Sorting Optimization and Dynamic Iterative Histogram Modification 基于块排序优化和动态迭代直方图修改的JPEG可逆数据隐藏
IF 8.4 1区 计算机科学
IEEE Transactions on Multimedia Pub Date : 2025-01-27 DOI: 10.1109/TMM.2025.3535320
Fengyong Li;Qiankuan Wang;Hang Cheng;Xinpeng Zhang;Chuan Qin
{"title":"JPEG Reversible Data Hiding via Block Sorting Optimization and Dynamic Iterative Histogram Modification","authors":"Fengyong Li;Qiankuan Wang;Hang Cheng;Xinpeng Zhang;Chuan Qin","doi":"10.1109/TMM.2025.3535320","DOIUrl":"https://doi.org/10.1109/TMM.2025.3535320","url":null,"abstract":"JPEG reversible data hiding (RDH) refers to covert communication technology to accurately extract secret data while also perfectly recovering the original JPEG image. With the development of cloud services, a large number of private JPEG images can be efficiently managed in cloud platforms by embedding user ID or authentication labels. Nevertheless, data embedding operations may inadvertently disrupt the encoding sequence of the original JPEG image, resulting in severe distortion of the host image when it is re-compressed to JPEG format. To address this problem, this paper proposes a new JPEG RDH scheme based on block sorting optimization and dynamic iterative histogram modification. We firstly design a block ordering optimization strategy by combining the number of zero coefficients and the quantization table values of non-zero coefficients in a DCT block. Subsequently, a dynamic iterative histogram modification scheme is proposed by considering the local features and embedding capability of histograms generated from different texture images. According to the given payloads, we introduce different parameters to control the iterations of two-dimensional histogram and then adaptively generate the optimal histogram modification mapping, which can realize low JPEG file size increments by guaranteeing most of the AC coefficients unchanged as much as possible. Numerous experiments have shown that our scheme can achieve an effective balance among embedding capacity, visual quality, file size increment, computational complexity, and outperforms the state-of-the-arts in terms of the above metrics.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"3729-3743"},"PeriodicalIF":8.4,"publicationDate":"2025-01-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144264248","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Text2Avatar: Articulated 3D Avatar Creation With Text Instructions Text2Avatar:用文字说明铰接式3D Avatar创建
IF 8.4 1区 计算机科学
IEEE Transactions on Multimedia Pub Date : 2025-01-27 DOI: 10.1109/TMM.2025.3535293
Yong-Hoon Kwon;Ju Hong Yoon;Min-Gyu Park
{"title":"Text2Avatar: Articulated 3D Avatar Creation With Text Instructions","authors":"Yong-Hoon Kwon;Ju Hong Yoon;Min-Gyu Park","doi":"10.1109/TMM.2025.3535293","DOIUrl":"https://doi.org/10.1109/TMM.2025.3535293","url":null,"abstract":"We propose a framework for creating articulated human avatars, editing their styles, and animating the human avatars from three different types of text instructions. The three types of instructions, identity, edit, and action, are fed into three models that generate, edit, and animate human avatars. Specifically, the proposed framework takes identity instruction and multi-view pose condition images to generate the images of a human using the avatar generation model. Then, the avatar can be edited with text instructions by changing the style of the images generated. We apply the Neural Radiance Field (NeRF) and Poisson reconstruction to extract a human mesh model from images and assign linear blend skinning (LBS) weights to the vertices. Finally, the action instructions can animate human avatars, where we use the off-the-shelf method to generate the motions from text instructions. Notably, our proposed method adapts the appearance of hundreds of different individuals to construct a conditionally editable avatar-generated model, allowing easy creation of 3D avatars using text instructions. We demonstrate high-fidelity 3D animatable avatar creation with text instructions on various datasets and highlight a superior performance of the proposed method compared to the previous studies.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"3797-3806"},"PeriodicalIF":8.4,"publicationDate":"2025-01-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144264244","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Learning Shape-Color Diffusion Priors for Text-Guided 3D Object Generation 学习形状-颜色扩散先验的文本引导3D对象生成
IF 8.4 1区 计算机科学
IEEE Transactions on Multimedia Pub Date : 2025-01-27 DOI: 10.1109/TMM.2025.3535325
Sheng-Yu Huang;Chi-Pin Huang;Kai-Po Chang;Zi-Ting Chou;I-Jieh Liu;Yu-Chiang Frank Wang
{"title":"Learning Shape-Color Diffusion Priors for Text-Guided 3D Object Generation","authors":"Sheng-Yu Huang;Chi-Pin Huang;Kai-Po Chang;Zi-Ting Chou;I-Jieh Liu;Yu-Chiang Frank Wang","doi":"10.1109/TMM.2025.3535325","DOIUrl":"https://doi.org/10.1109/TMM.2025.3535325","url":null,"abstract":"Generating 3D shapes according to specific textual input is a crucial topic in the multimedia application, with its potential enhancement to the VR/AR/XR usage that enables more diverse virtual scenes. Due to the recent success of diffusion models, text-guided 3D object generation has drawn a lot of attention recently. However, current latent diffusion-based methods are restricted to shape-only generation, requiring time-consuming and computationally expensive post-processing to obtain colored objects. In this paper, we propose an end-to-end <italic>Shape-Color Diffusion Prior framework (SCDiff)</i> to achieve colored text-to-3D object generation. Given a general text description as input, our SCDiff is able to distinguish shape and color-related priors in the text and generate a shape latent and a color latent for a pre-trained 3D object auto-encoder to derive colored 3D objects. Our SCDiff contains two 3D latent diffusion models (LDM), where one generates the shape latent from the input text and the other generates the color latent. To help the two LDMs focus on shape/color-related information, we further adopt a Large Language Model (LLM) to separate the input text into a shape phrase and a color phrase via an in-context learning technique so that our shape/color LDM would not be influenced by irrelevant information. Due to the separation of shape and color latent, we are able to manipulate the color of an object by giving different color phrases while maintaining the original shape. Experiments on a benchmark dataset would quantitatively and qualitatively verify the effectiveness and practicality of our proposed model. As an extension, we show the capability of our SCDiff on 3D object generation and manipulation based on various modality conditions, which further confirms the scalability and applications in multimedia of our proposed framework.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"3294-3306"},"PeriodicalIF":8.4,"publicationDate":"2025-01-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144264287","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信