CLIP-PAE: Projection-Augmentation Embedding to Extract Relevant Features for a Disentangled, Interpretable and Controllable Text-Guided Face Manipulation
{"title":"CLIP-PAE: Projection-Augmentation Embedding to Extract Relevant Features for a Disentangled, Interpretable and Controllable Text-Guided Face Manipulation","authors":"Chenliang Zhou, Fangcheng Zhong, C. Öztireli","doi":"10.1145/3588432.3591532","DOIUrl":null,"url":null,"abstract":"Recently introduced Contrastive Language-Image Pre-Training (CLIP) [Radford et al. 2021] bridges images and text by embedding them into a joint latent space. This opens the door to ample literature that aims to manipulate an input image by providing a textual explanation. However, due to the discrepancy between image and text embeddings in the joint space, using text embeddings as the optimization target often introduces undesired artifacts in the resulting images. Disentanglement, interpretability, and controllability are also hard to guarantee for manipulation. To alleviate these problems, we propose to define corpus subspaces spanned by relevant prompts to capture specific image characteristics. We introduce CLIP projection-augmentation embedding (PAE) as an optimization target to improve the performance of text-guided image manipulation. Our method is a simple and general paradigm that can be easily computed and adapted, and smoothly incorporated into any CLIP-based image manipulation algorithm. To demonstrate the effectiveness of our method, we conduct several theoretical and empirical studies. As a case study, we utilize the method for text-guided semantic face editing. We quantitatively and qualitatively demonstrate that PAE facilitates a more disentangled, interpretable, and controllable face image manipulation with state-of-the-art quality and accuracy.","PeriodicalId":280036,"journal":{"name":"ACM SIGGRAPH 2023 Conference Proceedings","volume":"63 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-10-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM SIGGRAPH 2023 Conference Proceedings","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3588432.3591532","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1
Abstract
Recently introduced Contrastive Language-Image Pre-Training (CLIP) [Radford et al. 2021] bridges images and text by embedding them into a joint latent space. This opens the door to ample literature that aims to manipulate an input image by providing a textual explanation. However, due to the discrepancy between image and text embeddings in the joint space, using text embeddings as the optimization target often introduces undesired artifacts in the resulting images. Disentanglement, interpretability, and controllability are also hard to guarantee for manipulation. To alleviate these problems, we propose to define corpus subspaces spanned by relevant prompts to capture specific image characteristics. We introduce CLIP projection-augmentation embedding (PAE) as an optimization target to improve the performance of text-guided image manipulation. Our method is a simple and general paradigm that can be easily computed and adapted, and smoothly incorporated into any CLIP-based image manipulation algorithm. To demonstrate the effectiveness of our method, we conduct several theoretical and empirical studies. As a case study, we utilize the method for text-guided semantic face editing. We quantitatively and qualitatively demonstrate that PAE facilitates a more disentangled, interpretable, and controllable face image manipulation with state-of-the-art quality and accuracy.
最近引入的对比语言图像预训练(CLIP) [Radford et al. 2021]通过将图像和文本嵌入到联合潜在空间中来连接图像和文本。这为大量旨在通过提供文本解释来操纵输入图像的文献打开了大门。然而,由于图像和文本嵌入在联合空间中的差异,使用文本嵌入作为优化目标往往会在得到的图像中引入不希望的伪影。对于操作来说,解缠性、可解释性和可控性也很难保证。为了缓解这些问题,我们建议定义由相关提示跨越的语料库子空间,以捕获特定的图像特征。我们引入CLIP投影增强嵌入(PAE)作为优化目标,以提高文本引导图像处理的性能。我们的方法是一个简单而通用的范例,可以很容易地计算和适应,并顺利地纳入任何基于clip的图像处理算法。为了证明我们的方法的有效性,我们进行了一些理论和实证研究。作为案例研究,我们利用文本引导的语义人脸编辑方法。我们在定量和定性上证明,PAE有助于以最先进的质量和精度进行更清晰、可解释和可控的面部图像处理。