文本到三维端到端生成中语义对齐的超球面最优传输。

IEEE transactions on visualization and computer graphics Pub Date : 2025-07-07 DOI:10.1109/TVCG.2025.3586646

Zezeng Li, Weimin Wang, Yuming Zhao, Wenhai Li, Na Lei, Xianfeng Gu

{"title":"文本到三维端到端生成中语义对齐的超球面最优传输。","authors":"Zezeng Li, Weimin Wang, Yuming Zhao, Wenhai Li, Na Lei, Xianfeng Gu","doi":"10.1109/TVCG.2025.3586646","DOIUrl":null,"url":null,"abstract":"Recent CLIP-guided 3D generation methods have achieved promising results but struggle with generating faithful 3D shapes that conform with input text due to the gap between text and image embeddings. To this end, this paper proposes HOTS3D which makes the first attempt to effectively bridge this gap by aligning text features to the image features with spherical optimal transport (SOT). However, in high-dimensional situations, solving the SOT remains a challenge. To obtain the SOT map for high-dimensional features obtained from CLIP encoding of two modalities, we mathematically formulate and derive the solution based on Villani's theorem, which can directly align two hyper-sphere distributions without manifold exponential maps. Furthermore, we implement it by leveraging input convex neural networks (ICNNs) for the optimal Kantorovich potential. With the optimally mapped features, a diffusion-based generator is utilized to decode them into 3D shapes. Extensive quantitative and qualitative comparisons with state-of-the-art methods demonstrate the superiority of HOTS3D for text-to-3D generation, especially in the consistency with text semantics. The code will be publicly available.","PeriodicalId":94035,"journal":{"name":"IEEE transactions on visualization and computer graphics","volume":"PP ","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2025-07-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Hyper-spherical Optimal Transport for Semantic Alignment in Text-to-3D End-to-end Generation.\",\"authors\":\"Zezeng Li, Weimin Wang, Yuming Zhao, Wenhai Li, Na Lei, Xianfeng Gu\",\"doi\":\"10.1109/TVCG.2025.3586646\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Recent CLIP-guided 3D generation methods have achieved promising results but struggle with generating faithful 3D shapes that conform with input text due to the gap between text and image embeddings. To this end, this paper proposes HOTS3D which makes the first attempt to effectively bridge this gap by aligning text features to the image features with spherical optimal transport (SOT). However, in high-dimensional situations, solving the SOT remains a challenge. To obtain the SOT map for high-dimensional features obtained from CLIP encoding of two modalities, we mathematically formulate and derive the solution based on Villani's theorem, which can directly align two hyper-sphere distributions without manifold exponential maps. Furthermore, we implement it by leveraging input convex neural networks (ICNNs) for the optimal Kantorovich potential. With the optimally mapped features, a diffusion-based generator is utilized to decode them into 3D shapes. Extensive quantitative and qualitative comparisons with state-of-the-art methods demonstrate the superiority of HOTS3D for text-to-3D generation, especially in the consistency with text semantics. The code will be publicly available.\",\"PeriodicalId\":94035,\"journal\":{\"name\":\"IEEE transactions on visualization and computer graphics\",\"volume\":\"PP \",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2025-07-07\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE transactions on visualization and computer graphics\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/TVCG.2025.3586646\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on visualization and computer graphics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/TVCG.2025.3586646","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

最近的clip引导的3D生成方法已经取得了很好的结果，但是由于文本和图像嵌入之间的差距，难以生成符合输入文本的忠实3D形状。为此，本文提出了HOTS3D，它首次尝试通过球面最优传输（SOT）将文本特征与图像特征对齐来有效地弥合这一差距。然而，在高维情况下，解决SOT仍然是一个挑战。为了获得两个模态的CLIP编码得到的高维特征的SOT映射，我们基于Villani定理进行了数学表述和推导，该方法可以直接对准两个超球分布，而不需要流形指数映射。此外，我们通过利用输入凸神经网络（icnn）来实现最优的Kantorovich势。利用最优映射的特征，利用基于扩散的生成器将其解码为3D形状。与最先进的方法进行了广泛的定量和定性比较，证明了HOTS3D在文本到3d生成方面的优势，特别是在与文本语义的一致性方面。代码将是公开的。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Hyper-spherical Optimal Transport for Semantic Alignment in Text-to-3D End-to-end Generation.

Recent CLIP-guided 3D generation methods have achieved promising results but struggle with generating faithful 3D shapes that conform with input text due to the gap between text and image embeddings. To this end, this paper proposes HOTS3D which makes the first attempt to effectively bridge this gap by aligning text features to the image features with spherical optimal transport (SOT). However, in high-dimensional situations, solving the SOT remains a challenge. To obtain the SOT map for high-dimensional features obtained from CLIP encoding of two modalities, we mathematically formulate and derive the solution based on Villani's theorem, which can directly align two hyper-sphere distributions without manifold exponential maps. Furthermore, we implement it by leveraging input convex neural networks (ICNNs) for the optimal Kantorovich potential. With the optimally mapped features, a diffusion-based generator is utilized to decode them into 3D shapes. Extensive quantitative and qualitative comparisons with state-of-the-art methods demonstrate the superiority of HOTS3D for text-to-3D generation, especially in the consistency with text semantics. The code will be publicly available.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE transactions on visualization and computer graphics

自引率

0.00%

发文量