CLIP-GAN: Stacking CLIPs and GAN for Efficient and Controllable Text-to-Image Synthesis

IF 9.7 1区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Multimedia Pub Date : 2025-01-27 DOI:10.1109/TMM.2025.3535304

Yingli Hou;Wei Zhang;Zhiliang Zhu;Hai Yu

{"title":"CLIP-GAN: Stacking CLIPs and GAN for Efficient and Controllable Text-to-Image Synthesis","authors":"Yingli Hou;Wei Zhang;Zhiliang Zhu;Hai Yu","doi":"10.1109/TMM.2025.3535304","DOIUrl":null,"url":null,"abstract":"Recent advances in text-to-image synthesis have captivated audiences worldwide, drawing considerable attention. Although significant progress in generating photo-realistic images through large pre-trained autoregressive and diffusion models, these models face three critical constraints: (1) The requirement for extensive training data and numerous model parameters; (2) Inefficient, multi-step image generation process; and (3) Difficulties in controlling the output visual features, requiring complexly designed prompts to ensure text-image alignment. Addressing these challenges, we introduce the CLIP-GAN model, which innovatively integrates the pretrained CLIP model into both the generator and discriminator of the GAN. Our architecture includes a CLIP-based generator that employs visual concepts derived from CLIP through text prompts in a feature adapter module. We also propose a CLIP-based discriminator, utilizing CLIP's advanced scene understanding capabilities for more precise image quality evaluation. Additionally, our generator applies visual concepts from CLIP via the Text-based Generator Block (TG-Block) and the Polarized Feature Fusion Module (PFFM) enabling better fusion of text and image semantic information. This integration within the generator and discriminator enhances training efficiency, enabling our model to achieve evaluation results not inferior to large pre-trained autoregressive and diffusion models, but with a 94% reduction in learnable parameters. CLIP-GAN aims to achieve the best efficiency-accuracy trade-off in image generation given the limited resource budget. Extensive evaluations validate the superior performance of the model, demonstrating faster image generation speed and the potential for greater stylistic diversity within the GAN model, while still preserving its smooth latent space.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"3702-3715"},"PeriodicalIF":9.7000,"publicationDate":"2025-01-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Multimedia","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10855452/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

Recent advances in text-to-image synthesis have captivated audiences worldwide, drawing considerable attention. Although significant progress in generating photo-realistic images through large pre-trained autoregressive and diffusion models, these models face three critical constraints: (1) The requirement for extensive training data and numerous model parameters; (2) Inefficient, multi-step image generation process; and (3) Difficulties in controlling the output visual features, requiring complexly designed prompts to ensure text-image alignment. Addressing these challenges, we introduce the CLIP-GAN model, which innovatively integrates the pretrained CLIP model into both the generator and discriminator of the GAN. Our architecture includes a CLIP-based generator that employs visual concepts derived from CLIP through text prompts in a feature adapter module. We also propose a CLIP-based discriminator, utilizing CLIP's advanced scene understanding capabilities for more precise image quality evaluation. Additionally, our generator applies visual concepts from CLIP via the Text-based Generator Block (TG-Block) and the Polarized Feature Fusion Module (PFFM) enabling better fusion of text and image semantic information. This integration within the generator and discriminator enhances training efficiency, enabling our model to achieve evaluation results not inferior to large pre-trained autoregressive and diffusion models, but with a 94% reduction in learnable parameters. CLIP-GAN aims to achieve the best efficiency-accuracy trade-off in image generation given the limited resource budget. Extensive evaluations validate the superior performance of the model, demonstrating faster image generation speed and the potential for greater stylistic diversity within the GAN model, while still preserving its smooth latent space.

查看原文本刊更多论文

CLIP-GAN：堆叠clip和GAN用于有效和可控的文本到图像合成

文本到图像合成的最新进展吸引了全世界的观众，引起了相当大的关注。尽管通过大型预训练自回归和扩散模型生成逼真图像取得了重大进展，但这些模型面临三个关键限制：(1)需要大量的训练数据和众多的模型参数；(2)图像生成过程多步，效率低；(3)难以控制输出的视觉特征，需要设计复杂的提示以确保文本图像对齐。针对这些挑战，我们引入了CLIP-GAN模型，该模型创新地将预训练的CLIP模型集成到GAN的生成器和鉴别器中。我们的体系结构包括一个基于CLIP的生成器，它通过特性适配器模块中的文本提示使用来自CLIP的可视化概念。我们还提出了一个基于CLIP的鉴别器，利用CLIP先进的场景理解能力进行更精确的图像质量评估。此外，我们的生成器通过基于文本的生成器块（TG-Block）和极化特征融合模块（PFFM）应用CLIP中的视觉概念，从而更好地融合文本和图像语义信息。生成器和鉴别器的这种集成提高了训练效率，使我们的模型获得的评估结果不逊于大型预训练的自回归和扩散模型，但可学习参数减少了94%。CLIP-GAN的目标是在有限的资源预算下实现图像生成的最佳效率和精度权衡。广泛的评估验证了该模型的优越性能，展示了更快的图像生成速度和GAN模型中更大风格多样性的潜力，同时仍然保留其平滑潜在空间。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Transactions on Multimedia 工程技术-电信学

CiteScore

11.70

自引率

11.00%

发文量

576

审稿时长

5.5 months

期刊介绍： The IEEE Transactions on Multimedia delves into diverse aspects of multimedia technology and applications, covering circuits, networking, signal processing, systems, software, and systems integration. The scope aligns with the Fields of Interest of the sponsors, ensuring a comprehensive exploration of research in multimedia.