基于提取的文本引导人脸生成语义转移

2023 IEEE International Conference on Multimedia and Expo (ICME) Pub Date : 2023-07-01 DOI:10.1109/ICME55011.2023.00149

Guoxing Yang, Feifei Fu, Nanyi Fei, Hao Wu, Ruitao Ma, Zhiwu Lu

{"title":"基于提取的文本引导人脸生成语义转移","authors":"Guoxing Yang, Feifei Fu, Nanyi Fei, Hao Wu, Ruitao Ma, Zhiwu Lu","doi":"10.1109/ICME55011.2023.00149","DOIUrl":null,"url":null,"abstract":"Recently, large-scale pre-training has achieved great success in multi-modal tasks and shown powerful generalization ability due to superior semantic comprehension. In the field of text-to-image synthesis, recent works induce large-scale pre-training with VQ-VAE as a discrete visual tokenizer, which can synthesize realistic images from arbitrary text inputs. However, the quality of images generated by these methods is still inferior to that of images generated by GAN-based methods, especially in some specific domains. To leverage both the superior semantic comprehension of large-scale pre-training models and the powerful ability of GAN-based models in photorealistic image generation, we propose a novel knowledge distillation framework termed DiST-GAN to transfer the semantic knowledge of large-scale visual-language pre-training models (e.g., CLIP) to GAN-based generator for text-guided face image generation. Our DiST-GAN consists of two key components: (1) A new CLIP-based adaptive contrastive loss is devised to ensure the generated images are consistent with the input texts. (2) A language-to-vision (L2V) transformation module is learned to transform token embeddings of each text into an intermediate embedding that is aligned with the image embedding extracted by CLIP. With these two novel components, the semantic knowledge contained in CLIP can thus be transferred to GAN-based generator which preserves the superior ability of photorealistic image generation in the mean time. Extensive results on the Multi-Modal CelebA-HQ dataset show that our DiST-GAN achieves significant improvements over the state-of-the-arts.","PeriodicalId":321830,"journal":{"name":"2023 IEEE International Conference on Multimedia and Expo (ICME)","volume":"20 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"DiST-GAN: Distillation-based Semantic Transfer for Text-Guided Face Generation\",\"authors\":\"Guoxing Yang, Feifei Fu, Nanyi Fei, Hao Wu, Ruitao Ma, Zhiwu Lu\",\"doi\":\"10.1109/ICME55011.2023.00149\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Recently, large-scale pre-training has achieved great success in multi-modal tasks and shown powerful generalization ability due to superior semantic comprehension. In the field of text-to-image synthesis, recent works induce large-scale pre-training with VQ-VAE as a discrete visual tokenizer, which can synthesize realistic images from arbitrary text inputs. However, the quality of images generated by these methods is still inferior to that of images generated by GAN-based methods, especially in some specific domains. To leverage both the superior semantic comprehension of large-scale pre-training models and the powerful ability of GAN-based models in photorealistic image generation, we propose a novel knowledge distillation framework termed DiST-GAN to transfer the semantic knowledge of large-scale visual-language pre-training models (e.g., CLIP) to GAN-based generator for text-guided face image generation. Our DiST-GAN consists of two key components: (1) A new CLIP-based adaptive contrastive loss is devised to ensure the generated images are consistent with the input texts. (2) A language-to-vision (L2V) transformation module is learned to transform token embeddings of each text into an intermediate embedding that is aligned with the image embedding extracted by CLIP. With these two novel components, the semantic knowledge contained in CLIP can thus be transferred to GAN-based generator which preserves the superior ability of photorealistic image generation in the mean time. Extensive results on the Multi-Modal CelebA-HQ dataset show that our DiST-GAN achieves significant improvements over the state-of-the-arts.\",\"PeriodicalId\":321830,\"journal\":{\"name\":\"2023 IEEE International Conference on Multimedia and Expo (ICME)\",\"volume\":\"20 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-07-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2023 IEEE International Conference on Multimedia and Expo (ICME)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICME55011.2023.00149\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 IEEE International Conference on Multimedia and Expo (ICME)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICME55011.2023.00149","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

近年来，大规模预训练在多模态任务中取得了巨大的成功，并由于其优越的语义理解能力而显示出强大的泛化能力。在文本到图像合成领域，最近的研究将VQ-VAE作为离散视觉标记器进行大规模预训练，可以从任意文本输入中合成逼真的图像。然而，这些方法生成的图像质量仍然不如基于gan的方法生成的图像，特别是在一些特定的领域。为了利用大规模预训练模型优越的语义理解能力和基于gan的模型在真实感图像生成中的强大能力，我们提出了一种新的知识蒸馏框架，称为DiST-GAN，将大规模视觉语言预训练模型(如CLIP)的语义知识转移到基于gan的生成器中，用于文本引导的人脸图像生成。我们的DiST-GAN由两个关键组件组成:(1)设计了一种新的基于clip的自适应对比度损失，以确保生成的图像与输入文本一致。(2)学习了语言到视觉(L2V)转换模块，将每个文本的标记嵌入转换为与CLIP提取的图像嵌入对齐的中间嵌入。利用这两个新组件，CLIP中包含的语义知识可以转移到基于gan的生成器中，同时保持了优越的逼真图像生成能力。在多模态CelebA-HQ数据集上的广泛结果表明，我们的DiST-GAN比最先进的技术取得了显着改进。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

DiST-GAN: Distillation-based Semantic Transfer for Text-Guided Face Generation

Recently, large-scale pre-training has achieved great success in multi-modal tasks and shown powerful generalization ability due to superior semantic comprehension. In the field of text-to-image synthesis, recent works induce large-scale pre-training with VQ-VAE as a discrete visual tokenizer, which can synthesize realistic images from arbitrary text inputs. However, the quality of images generated by these methods is still inferior to that of images generated by GAN-based methods, especially in some specific domains. To leverage both the superior semantic comprehension of large-scale pre-training models and the powerful ability of GAN-based models in photorealistic image generation, we propose a novel knowledge distillation framework termed DiST-GAN to transfer the semantic knowledge of large-scale visual-language pre-training models (e.g., CLIP) to GAN-based generator for text-guided face image generation. Our DiST-GAN consists of two key components: (1) A new CLIP-based adaptive contrastive loss is devised to ensure the generated images are consistent with the input texts. (2) A language-to-vision (L2V) transformation module is learned to transform token embeddings of each text into an intermediate embedding that is aligned with the image embedding extracted by CLIP. With these two novel components, the semantic knowledge contained in CLIP can thus be transferred to GAN-based generator which preserves the superior ability of photorealistic image generation in the mean time. Extensive results on the Multi-Modal CelebA-HQ dataset show that our DiST-GAN achieves significant improvements over the state-of-the-arts.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2023 IEEE International Conference on Multimedia and Expo (ICME)

自引率

0.00%

发文量