FICE：利用引导式 GAN 反演进行文本条件时尚图像编辑

IF 7.5 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Pattern Recognition Pub Date : 2024-09-14 DOI:10.1016/j.patcog.2024.111022

{"title":"FICE：利用引导式 GAN 反演进行文本条件时尚图像编辑","authors":"","doi":"10.1016/j.patcog.2024.111022","DOIUrl":null,"url":null,"abstract":"<div><p>Fashion-image editing is a challenging computer-vision task where the goal is to incorporate selected apparel into a given input image. Most existing techniques, known as Virtual Try-On methods, deal with this task by first selecting an example image of the desired apparel and then transferring the clothing onto the target person. Conversely, in this paper, we consider editing fashion images with text descriptions. Such an approach has several advantages over example-based virtual try-on techniques: (i) it does not require an image of the target fashion item, and (ii) it allows the expression of a wide variety of visual concepts through the use of natural language. Existing image-editing methods that work with language inputs are heavily constrained by their requirement for training sets with rich attribute annotations or they are only able to handle simple text descriptions. We address these constraints by proposing a novel text-conditioned editing model called FICE (Fashion Image CLIP Editing) that is capable of handling a wide variety of diverse text descriptions to guide the editing procedure. Specifically, with FICE, we extend the common GAN-inversion process by including semantic, pose-related, and image-level constraints when generating images. We leverage the capabilities of the CLIP model to enforce the text-provided semantics, due to its impressive image–text association capabilities. We furthermore propose a latent-code regularization technique that provides the means to better control the fidelity of the synthesized images. We validate the FICE through rigorous experiments on a combination of VITON images and Fashion-Gen text descriptions and in comparison with several state-of-the-art, text-conditioned, image-editing approaches. Experimental results demonstrate that the FICE generates very realistic fashion images and leads to better editing than existing, competing approaches. The source code is publicly available from: <span><span>https://github.com/MartinPernus/FICE</span><svg><path></path></svg></span>.</p></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":null,"pages":null},"PeriodicalIF":7.5000,"publicationDate":"2024-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"FICE: Text-conditioned fashion-image editing with guided GAN inversion\",\"authors\":\"\",\"doi\":\"10.1016/j.patcog.2024.111022\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>Fashion-image editing is a challenging computer-vision task where the goal is to incorporate selected apparel into a given input image. Most existing techniques, known as Virtual Try-On methods, deal with this task by first selecting an example image of the desired apparel and then transferring the clothing onto the target person. Conversely, in this paper, we consider editing fashion images with text descriptions. Such an approach has several advantages over example-based virtual try-on techniques: (i) it does not require an image of the target fashion item, and (ii) it allows the expression of a wide variety of visual concepts through the use of natural language. Existing image-editing methods that work with language inputs are heavily constrained by their requirement for training sets with rich attribute annotations or they are only able to handle simple text descriptions. We address these constraints by proposing a novel text-conditioned editing model called FICE (Fashion Image CLIP Editing) that is capable of handling a wide variety of diverse text descriptions to guide the editing procedure. Specifically, with FICE, we extend the common GAN-inversion process by including semantic, pose-related, and image-level constraints when generating images. We leverage the capabilities of the CLIP model to enforce the text-provided semantics, due to its impressive image–text association capabilities. We furthermore propose a latent-code regularization technique that provides the means to better control the fidelity of the synthesized images. We validate the FICE through rigorous experiments on a combination of VITON images and Fashion-Gen text descriptions and in comparison with several state-of-the-art, text-conditioned, image-editing approaches. Experimental results demonstrate that the FICE generates very realistic fashion images and leads to better editing than existing, competing approaches. The source code is publicly available from: <span><span>https://github.com/MartinPernus/FICE</span><svg><path></path></svg></span>.</p></div>\",\"PeriodicalId\":49713,\"journal\":{\"name\":\"Pattern Recognition\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":7.5000,\"publicationDate\":\"2024-09-14\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Pattern Recognition\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0031320324007738\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Pattern Recognition","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0031320324007738","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

时尚图像编辑是一项具有挑战性的计算机视觉任务，其目标是将选定的服装融入给定的输入图像中。现有的大多数技术，即虚拟试穿方法，都是通过首先选择所需的服装示例图像，然后将服装转移到目标人物身上来完成这项任务的。相反，在本文中，我们考虑用文字描述来编辑时装图像。与基于示例的虚拟试穿技术相比，这种方法有几个优点：(i) 它不需要目标时装的图像，(ii) 它允许通过使用自然语言来表达各种视觉概念。现有的使用语言输入的图像编辑方法受到很大限制，因为它们要求训练集具有丰富的属性注释，或者只能处理简单的文本描述。针对这些限制，我们提出了一种名为 FICE（时尚图像 CLIP 编辑）的新颖文本条件编辑模型，该模型能够处理各种不同的文本描述，为编辑过程提供指导。具体来说，通过 FICE，我们扩展了常见的 GAN 转换过程，在生成图像时加入了语义、姿势相关和图像级约束。由于 CLIP 模型具有令人印象深刻的图像-文本关联能力，因此我们利用 CLIP 模型的功能来执行文本提供的语义。此外，我们还提出了一种潜在代码正则化技术，为更好地控制合成图像的保真度提供了手段。我们通过在 VITON 图像和 Fashion-Gen 文本描述组合上进行严格实验，并与几种最先进的、以文本为条件的图像编辑方法进行比较，对 FICE 进行了验证。实验结果表明，FICE 生成的时尚图像非常逼真，与现有的竞争方法相比，其编辑效果更好。源代码可从 https://github.com/MartinPernus/FICE 公开获取。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

FICE: Text-conditioned fashion-image editing with guided GAN inversion

Fashion-image editing is a challenging computer-vision task where the goal is to incorporate selected apparel into a given input image. Most existing techniques, known as Virtual Try-On methods, deal with this task by first selecting an example image of the desired apparel and then transferring the clothing onto the target person. Conversely, in this paper, we consider editing fashion images with text descriptions. Such an approach has several advantages over example-based virtual try-on techniques: (i) it does not require an image of the target fashion item, and (ii) it allows the expression of a wide variety of visual concepts through the use of natural language. Existing image-editing methods that work with language inputs are heavily constrained by their requirement for training sets with rich attribute annotations or they are only able to handle simple text descriptions. We address these constraints by proposing a novel text-conditioned editing model called FICE (Fashion Image CLIP Editing) that is capable of handling a wide variety of diverse text descriptions to guide the editing procedure. Specifically, with FICE, we extend the common GAN-inversion process by including semantic, pose-related, and image-level constraints when generating images. We leverage the capabilities of the CLIP model to enforce the text-provided semantics, due to its impressive image–text association capabilities. We furthermore propose a latent-code regularization technique that provides the means to better control the fidelity of the synthesized images. We validate the FICE through rigorous experiments on a combination of VITON images and Fashion-Gen text descriptions and in comparison with several state-of-the-art, text-conditioned, image-editing approaches. Experimental results demonstrate that the FICE generates very realistic fashion images and leads to better editing than existing, competing approaches. The source code is publicly available from: https://github.com/MartinPernus/FICE.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Pattern Recognition 工程技术-工程：电子与电气

CiteScore

14.40

自引率

16.20%

发文量

683

审稿时长

5.6 months

期刊介绍： The field of Pattern Recognition is both mature and rapidly evolving, playing a crucial role in various related fields such as computer vision, image processing, text analysis, and neural networks. It closely intersects with machine learning and is being applied in emerging areas like biometrics, bioinformatics, multimedia data analysis, and data science. The journal Pattern Recognition, established half a century ago during the early days of computer science, has since grown significantly in scope and influence.