{"title":"CLIP-GAN: Stacking CLIPs and GAN for Efficient and Controllable Text-to-Image Synthesis","authors":"Yingli Hou;Wei Zhang;Zhiliang Zhu;Hai Yu","doi":"10.1109/TMM.2025.3535304","DOIUrl":null,"url":null,"abstract":"Recent advances in text-to-image synthesis have captivated audiences worldwide, drawing considerable attention. Although significant progress in generating photo-realistic images through large pre-trained autoregressive and diffusion models, these models face three critical constraints: (1) The requirement for extensive training data and numerous model parameters; (2) Inefficient, multi-step image generation process; and (3) Difficulties in controlling the output visual features, requiring complexly designed prompts to ensure text-image alignment. Addressing these challenges, we introduce the CLIP-GAN model, which innovatively integrates the pretrained CLIP model into both the generator and discriminator of the GAN. Our architecture includes a CLIP-based generator that employs visual concepts derived from CLIP through text prompts in a feature adapter module. We also propose a CLIP-based discriminator, utilizing CLIP's advanced scene understanding capabilities for more precise image quality evaluation. Additionally, our generator applies visual concepts from CLIP via the Text-based Generator Block (TG-Block) and the Polarized Feature Fusion Module (PFFM) enabling better fusion of text and image semantic information. This integration within the generator and discriminator enhances training efficiency, enabling our model to achieve evaluation results not inferior to large pre-trained autoregressive and diffusion models, but with a 94% reduction in learnable parameters. CLIP-GAN aims to achieve the best efficiency-accuracy trade-off in image generation given the limited resource budget. Extensive evaluations validate the superior performance of the model, demonstrating faster image generation speed and the potential for greater stylistic diversity within the GAN model, while still preserving its smooth latent space.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"3702-3715"},"PeriodicalIF":9.7000,"publicationDate":"2025-01-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Multimedia","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10855452/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0
Abstract
Recent advances in text-to-image synthesis have captivated audiences worldwide, drawing considerable attention. Although significant progress in generating photo-realistic images through large pre-trained autoregressive and diffusion models, these models face three critical constraints: (1) The requirement for extensive training data and numerous model parameters; (2) Inefficient, multi-step image generation process; and (3) Difficulties in controlling the output visual features, requiring complexly designed prompts to ensure text-image alignment. Addressing these challenges, we introduce the CLIP-GAN model, which innovatively integrates the pretrained CLIP model into both the generator and discriminator of the GAN. Our architecture includes a CLIP-based generator that employs visual concepts derived from CLIP through text prompts in a feature adapter module. We also propose a CLIP-based discriminator, utilizing CLIP's advanced scene understanding capabilities for more precise image quality evaluation. Additionally, our generator applies visual concepts from CLIP via the Text-based Generator Block (TG-Block) and the Polarized Feature Fusion Module (PFFM) enabling better fusion of text and image semantic information. This integration within the generator and discriminator enhances training efficiency, enabling our model to achieve evaluation results not inferior to large pre-trained autoregressive and diffusion models, but with a 94% reduction in learnable parameters. CLIP-GAN aims to achieve the best efficiency-accuracy trade-off in image generation given the limited resource budget. Extensive evaluations validate the superior performance of the model, demonstrating faster image generation speed and the potential for greater stylistic diversity within the GAN model, while still preserving its smooth latent space.
期刊介绍:
The IEEE Transactions on Multimedia delves into diverse aspects of multimedia technology and applications, covering circuits, networking, signal processing, systems, software, and systems integration. The scope aligns with the Fields of Interest of the sponsors, ensuring a comprehensive exploration of research in multimedia.