文本到三维端到端生成中语义对齐的超球面最优传输。

Zezeng Li, Weimin Wang, Yuming Zhao, Wenhai Li, Na Lei, Xianfeng Gu
{"title":"文本到三维端到端生成中语义对齐的超球面最优传输。","authors":"Zezeng Li, Weimin Wang, Yuming Zhao, Wenhai Li, Na Lei, Xianfeng Gu","doi":"10.1109/TVCG.2025.3586646","DOIUrl":null,"url":null,"abstract":"<p><p>Recent CLIP-guided 3D generation methods have achieved promising results but struggle with generating faithful 3D shapes that conform with input text due to the gap between text and image embeddings. To this end, this paper proposes HOTS3D which makes the first attempt to effectively bridge this gap by aligning text features to the image features with spherical optimal transport (SOT). However, in high-dimensional situations, solving the SOT remains a challenge. To obtain the SOT map for high-dimensional features obtained from CLIP encoding of two modalities, we mathematically formulate and derive the solution based on Villani's theorem, which can directly align two hyper-sphere distributions without manifold exponential maps. Furthermore, we implement it by leveraging input convex neural networks (ICNNs) for the optimal Kantorovich potential. With the optimally mapped features, a diffusion-based generator is utilized to decode them into 3D shapes. Extensive quantitative and qualitative comparisons with state-of-the-art methods demonstrate the superiority of HOTS3D for text-to-3D generation, especially in the consistency with text semantics. The code will be publicly available.</p>","PeriodicalId":94035,"journal":{"name":"IEEE transactions on visualization and computer graphics","volume":"PP ","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2025-07-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Hyper-spherical Optimal Transport for Semantic Alignment in Text-to-3D End-to-end Generation.\",\"authors\":\"Zezeng Li, Weimin Wang, Yuming Zhao, Wenhai Li, Na Lei, Xianfeng Gu\",\"doi\":\"10.1109/TVCG.2025.3586646\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><p>Recent CLIP-guided 3D generation methods have achieved promising results but struggle with generating faithful 3D shapes that conform with input text due to the gap between text and image embeddings. To this end, this paper proposes HOTS3D which makes the first attempt to effectively bridge this gap by aligning text features to the image features with spherical optimal transport (SOT). However, in high-dimensional situations, solving the SOT remains a challenge. To obtain the SOT map for high-dimensional features obtained from CLIP encoding of two modalities, we mathematically formulate and derive the solution based on Villani's theorem, which can directly align two hyper-sphere distributions without manifold exponential maps. Furthermore, we implement it by leveraging input convex neural networks (ICNNs) for the optimal Kantorovich potential. With the optimally mapped features, a diffusion-based generator is utilized to decode them into 3D shapes. Extensive quantitative and qualitative comparisons with state-of-the-art methods demonstrate the superiority of HOTS3D for text-to-3D generation, especially in the consistency with text semantics. The code will be publicly available.</p>\",\"PeriodicalId\":94035,\"journal\":{\"name\":\"IEEE transactions on visualization and computer graphics\",\"volume\":\"PP \",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2025-07-07\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE transactions on visualization and computer graphics\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/TVCG.2025.3586646\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on visualization and computer graphics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/TVCG.2025.3586646","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

最近的clip引导的3D生成方法已经取得了很好的结果,但是由于文本和图像嵌入之间的差距,难以生成符合输入文本的忠实3D形状。为此,本文提出了HOTS3D,它首次尝试通过球面最优传输(SOT)将文本特征与图像特征对齐来有效地弥合这一差距。然而,在高维情况下,解决SOT仍然是一个挑战。为了获得两个模态的CLIP编码得到的高维特征的SOT映射,我们基于Villani定理进行了数学表述和推导,该方法可以直接对准两个超球分布,而不需要流形指数映射。此外,我们通过利用输入凸神经网络(icnn)来实现最优的Kantorovich势。利用最优映射的特征,利用基于扩散的生成器将其解码为3D形状。与最先进的方法进行了广泛的定量和定性比较,证明了HOTS3D在文本到3d生成方面的优势,特别是在与文本语义的一致性方面。代码将是公开的。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Hyper-spherical Optimal Transport for Semantic Alignment in Text-to-3D End-to-end Generation.

Recent CLIP-guided 3D generation methods have achieved promising results but struggle with generating faithful 3D shapes that conform with input text due to the gap between text and image embeddings. To this end, this paper proposes HOTS3D which makes the first attempt to effectively bridge this gap by aligning text features to the image features with spherical optimal transport (SOT). However, in high-dimensional situations, solving the SOT remains a challenge. To obtain the SOT map for high-dimensional features obtained from CLIP encoding of two modalities, we mathematically formulate and derive the solution based on Villani's theorem, which can directly align two hyper-sphere distributions without manifold exponential maps. Furthermore, we implement it by leveraging input convex neural networks (ICNNs) for the optimal Kantorovich potential. With the optimally mapped features, a diffusion-based generator is utilized to decode them into 3D shapes. Extensive quantitative and qualitative comparisons with state-of-the-art methods demonstrate the superiority of HOTS3D for text-to-3D generation, especially in the consistency with text semantics. The code will be publicly available.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信