Diff-3DCap: Shape Captioning With Diffusion Models.

IEEE transactions on visualization and computer graphics Pub Date : 2025-04-28 DOI:10.1109/TVCG.2025.3564664

Zhenyu Shu, Jiawei Wen, Shiyang Li, Shiqing Xin, Ligang Liu

{"title":"Diff-3DCap: Shape Captioning With Diffusion Models.","authors":"Zhenyu Shu, Jiawei Wen, Shiyang Li, Shiqing Xin, Ligang Liu","doi":"10.1109/TVCG.2025.3564664","DOIUrl":null,"url":null,"abstract":"<p><p>The task of 3D shape captioning occupies a significant place within the domain of computer graphics and has garnered considerable interest in recent years. Traditional approaches to this challenge frequently depend on the utilization of costly voxel representations or object detection techniques, yet often fail to deliver satisfactory outcomes. To address the above challenges, in this paper, we introduce Diff-3DCap, which employs a sequence of projected views to represent a 3D object and a continuous diffusion model to facilitate the captioning process. More precisely, our approach utilizes the continuous diffusion model to perturb the embedded captions during the forward phase by introducing Gaussian noise and then predicts the reconstructed annotation during the reverse phase. Embedded within the diffusion framework is a commitment to leveraging a visual embedding obtained from a pre-trained visual-language model, which naturally allows the embedding to serve as a guiding signal, eliminating the need for an additional classifier. Extensive results of our experiments indicate that Diff-3DCap can achieve performance comparable to that of the current state-of-the-art methods.</p>","PeriodicalId":94035,"journal":{"name":"IEEE transactions on visualization and computer graphics","volume":"PP ","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2025-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on visualization and computer graphics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/TVCG.2025.3564664","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

The task of 3D shape captioning occupies a significant place within the domain of computer graphics and has garnered considerable interest in recent years. Traditional approaches to this challenge frequently depend on the utilization of costly voxel representations or object detection techniques, yet often fail to deliver satisfactory outcomes. To address the above challenges, in this paper, we introduce Diff-3DCap, which employs a sequence of projected views to represent a 3D object and a continuous diffusion model to facilitate the captioning process. More precisely, our approach utilizes the continuous diffusion model to perturb the embedded captions during the forward phase by introducing Gaussian noise and then predicts the reconstructed annotation during the reverse phase. Embedded within the diffusion framework is a commitment to leveraging a visual embedding obtained from a pre-trained visual-language model, which naturally allows the embedding to serve as a guiding signal, eliminating the need for an additional classifier. Extensive results of our experiments indicate that Diff-3DCap can achieve performance comparable to that of the current state-of-the-art methods.

查看原文本刊更多论文

diffi - 3dcap：形状字幕与扩散模型。

三维形状标注在计算机图形学领域占有重要地位，近年来引起了人们的极大兴趣。应对这一挑战的传统方法往往依赖于使用昂贵的体素表示或目标检测技术，但往往无法提供令人满意的结果。为了解决上述挑战，本文引入了Diff-3DCap，它使用一系列投影视图来表示3D对象，并使用连续扩散模型来简化字幕过程。更精确地说，我们的方法利用连续扩散模型在正向阶段通过引入高斯噪声对嵌入的注释进行扰动，然后在反向阶段预测重构的注释。在扩散框架中嵌入的是利用从预训练的视觉语言模型中获得的视觉嵌入的承诺，这自然允许嵌入作为指导信号，消除了对额外分类器的需要。我们的大量实验结果表明，Diff-3DCap可以达到与当前最先进的方法相当的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE transactions on visualization and computer graphics

自引率

0.00%

发文量