CreativeSynth: Cross-Art-Attention for Artistic Image Synthesis With Multimodal Diffusion.

IEEE transactions on visualization and computer graphics Pub Date : 2025-05-16 DOI:10.1109/TVCG.2025.3570771

Nisha Huang, Weiming Dong, Yuxin Zhang, Fan Tang, Ronghui Li, Chongyang Ma, Xiu Li, Tong-Yee Lee, Changsheng Xu

{"title":"CreativeSynth: Cross-Art-Attention for Artistic Image Synthesis With Multimodal Diffusion.","authors":"Nisha Huang, Weiming Dong, Yuxin Zhang, Fan Tang, Ronghui Li, Chongyang Ma, Xiu Li, Tong-Yee Lee, Changsheng Xu","doi":"10.1109/TVCG.2025.3570771","DOIUrl":null,"url":null,"abstract":"<p><p>Although remarkable progress has been made in image style transfer, style is just one of the components of artistic paintings. Directly transferring extracted style features to natural images often results in outputs with obvious synthetic traces. This is because key painting attributes including layout, perspective, shape, and semantics often cannot be conveyed and expressed through style transfer. Large-scale pretrained text-to-image generation models have demonstrated their capability to synthesize a vast amount of high-quality images. However, even with extensive textual descriptions, it is challenging to fully express the unique visual properties and details of paintings. Moreover, generic models often disrupt the overall artistic effect when modifying specific areas, making it more complicated to achieve a unified aesthetic in artworks. Our main novel idea is to integrate multimodal semantic information as a synthesis guide into artworks, rather than transferring style to the real world. We also aim to reduce the disruption to the harmony of artworks while simplifying the guidance conditions. Specifically, we propose an innovative multi-task unified framework called CreativeSynth, based on the diffusion model with the ability to coordinate multimodal inputs. CreativeSynth combines multimodal features with customized attention mechanisms to seamlessly integrate real-world semantic content into the art domain through Cross-Art-Attention for aesthetic maintenance and semantic fusion. We demonstrate the results of our method across a wide range of different art categories, proving that CreativeSynth bridges the gap between generative models and artistic expression. Code and results are available at https://github.com/haha-lisa/CreativeSynth.</p>","PeriodicalId":94035,"journal":{"name":"IEEE transactions on visualization and computer graphics","volume":"PP ","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2025-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on visualization and computer graphics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/TVCG.2025.3570771","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Although remarkable progress has been made in image style transfer, style is just one of the components of artistic paintings. Directly transferring extracted style features to natural images often results in outputs with obvious synthetic traces. This is because key painting attributes including layout, perspective, shape, and semantics often cannot be conveyed and expressed through style transfer. Large-scale pretrained text-to-image generation models have demonstrated their capability to synthesize a vast amount of high-quality images. However, even with extensive textual descriptions, it is challenging to fully express the unique visual properties and details of paintings. Moreover, generic models often disrupt the overall artistic effect when modifying specific areas, making it more complicated to achieve a unified aesthetic in artworks. Our main novel idea is to integrate multimodal semantic information as a synthesis guide into artworks, rather than transferring style to the real world. We also aim to reduce the disruption to the harmony of artworks while simplifying the guidance conditions. Specifically, we propose an innovative multi-task unified framework called CreativeSynth, based on the diffusion model with the ability to coordinate multimodal inputs. CreativeSynth combines multimodal features with customized attention mechanisms to seamlessly integrate real-world semantic content into the art domain through Cross-Art-Attention for aesthetic maintenance and semantic fusion. We demonstrate the results of our method across a wide range of different art categories, proving that CreativeSynth bridges the gap between generative models and artistic expression. Code and results are available at https://github.com/haha-lisa/CreativeSynth.

查看原文本刊更多论文

CreativeSynth：跨艺术关注与多模态扩散的艺术图像合成。

虽然在图像风格转移方面取得了显著进展，但风格只是艺术绘画的组成部分之一。将提取的风格特征直接转移到自然图像中，往往会导致输出具有明显的合成痕迹。这是因为关键的绘画属性，包括布局、透视、形状和语义，往往不能通过风格转移来传达和表达。大规模预训练的文本到图像生成模型已经证明了它们合成大量高质量图像的能力。然而，即使有大量的文字描述，也很难充分表达绘画独特的视觉属性和细节。此外，通用模型在修改特定区域时，往往会破坏整体的艺术效果，使艺术品实现统一的审美变得更加复杂。我们的主要新颖想法是将多模态语义信息作为综合指南整合到艺术品中，而不是将风格转移到现实世界中。我们也希望在简化引导条件的同时减少对艺术品和谐的破坏。具体来说，我们提出了一个创新的多任务统一框架，称为CreativeSynth，基于扩散模型，具有协调多模态输入的能力。CreativeSynth将多模态功能与定制的注意力机制相结合，通过Cross-Art-Attention将现实世界的语义内容无缝集成到艺术领域，以实现美学维护和语义融合。我们在广泛的不同艺术类别中展示了我们的方法的结果，证明CreativeSynth弥合了生成模型和艺术表达之间的差距。代码和结果可在https://github.com/haha-lisa/CreativeSynth上获得。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE transactions on visualization and computer graphics

自引率

0.00%

发文量