利用以服装为重点的扩散模型改进虚拟试穿

arXiv - CS - Multimedia Pub Date : 2024-09-12 DOI:arxiv-2409.08258

Siqi Wan, Yehao Li, Jingwen Chen, Yingwei Pan, Ting Yao, Yang Cao, Tao Mei

{"title":"利用以服装为重点的扩散模型改进虚拟试穿","authors":"Siqi Wan, Yehao Li, Jingwen Chen, Yingwei Pan, Ting Yao, Yang Cao, Tao Mei","doi":"arxiv-2409.08258","DOIUrl":null,"url":null,"abstract":"Diffusion models have led to the revolutionizing of generative modeling in\nnumerous image synthesis tasks. Nevertheless, it is not trivial to directly\napply diffusion models for synthesizing an image of a target person wearing a\ngiven in-shop garment, i.e., image-based virtual try-on (VTON) task. The\ndifficulty originates from the aspect that the diffusion process should not\nonly produce holistically high-fidelity photorealistic image of the target\nperson, but also locally preserve every appearance and texture detail of the\ngiven garment. To address this, we shape a new Diffusion model, namely GarDiff,\nwhich triggers the garment-focused diffusion process with amplified guidance of\nboth basic visual appearance and detailed textures (i.e., high-frequency\ndetails) derived from the given garment. GarDiff first remoulds a pre-trained\nlatent diffusion model with additional appearance priors derived from the CLIP\nand VAE encodings of the reference garment. Meanwhile, a novel garment-focused\nadapter is integrated into the UNet of diffusion model, pursuing local\nfine-grained alignment with the visual appearance of reference garment and\nhuman pose. We specifically design an appearance loss over the synthesized\ngarment to enhance the crucial, high-frequency details. Extensive experiments\non VITON-HD and DressCode datasets demonstrate the superiority of our GarDiff\nwhen compared to state-of-the-art VTON approaches. Code is publicly available\nat:\n\\href{https://github.com/siqi0905/GarDiff/tree/master}{https://github.com/siqi0905/GarDiff/tree/master}.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"12 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Improving Virtual Try-On with Garment-focused Diffusion Models\",\"authors\":\"Siqi Wan, Yehao Li, Jingwen Chen, Yingwei Pan, Ting Yao, Yang Cao, Tao Mei\",\"doi\":\"arxiv-2409.08258\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Diffusion models have led to the revolutionizing of generative modeling in\\nnumerous image synthesis tasks. Nevertheless, it is not trivial to directly\\napply diffusion models for synthesizing an image of a target person wearing a\\ngiven in-shop garment, i.e., image-based virtual try-on (VTON) task. The\\ndifficulty originates from the aspect that the diffusion process should not\\nonly produce holistically high-fidelity photorealistic image of the target\\nperson, but also locally preserve every appearance and texture detail of the\\ngiven garment. To address this, we shape a new Diffusion model, namely GarDiff,\\nwhich triggers the garment-focused diffusion process with amplified guidance of\\nboth basic visual appearance and detailed textures (i.e., high-frequency\\ndetails) derived from the given garment. GarDiff first remoulds a pre-trained\\nlatent diffusion model with additional appearance priors derived from the CLIP\\nand VAE encodings of the reference garment. Meanwhile, a novel garment-focused\\nadapter is integrated into the UNet of diffusion model, pursuing local\\nfine-grained alignment with the visual appearance of reference garment and\\nhuman pose. We specifically design an appearance loss over the synthesized\\ngarment to enhance the crucial, high-frequency details. Extensive experiments\\non VITON-HD and DressCode datasets demonstrate the superiority of our GarDiff\\nwhen compared to state-of-the-art VTON approaches. Code is publicly available\\nat:\\n\\\\href{https://github.com/siqi0905/GarDiff/tree/master}{https://github.com/siqi0905/GarDiff/tree/master}.\",\"PeriodicalId\":501480,\"journal\":{\"name\":\"arXiv - CS - Multimedia\",\"volume\":\"12 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-12\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Multimedia\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2409.08258\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Multimedia","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.08258","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

扩散模型为无数图像合成任务的生成建模带来了革命性的变革。然而，直接应用扩散模型合成目标人物穿着特定店内服装的图像（即基于图像的虚拟试穿（VTON）任务）并非易事。难点在于扩散过程不仅要生成高保真的目标人物整体逼真图像，还要在局部保留服装的每一个外观和纹理细节。为了解决这个问题，我们设计了一种新的扩散模型，即 GarDiff，它可以通过对给定服装的基本视觉外观和细节纹理（即高频细节）的放大引导来触发以服装为中心的扩散过程。GarDiff 首先利用从参考服装的 CLIP 和 VAE 编码中提取的附加外观前验，重塑预先训练好的静态扩散模型。与此同时，一个以服装为重点的新适配器被集成到扩散模型的 UNet 中，以追求与参考服装和人体姿势的视觉外观进行局部精细调整。我们特别设计了合成服装的外观损失，以增强关键的高频细节。在VITON-HD和DressCode数据集上进行的大量实验证明，与最先进的VTON方法相比，我们的GarDiff方法更胜一筹。代码可在以下网址公开获取：\href{https://github.com/siqi0905/GarDiff/tree/master}{https://github.com/siqi0905/GarDiff/tree/master}。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Improving Virtual Try-On with Garment-focused Diffusion Models

Diffusion models have led to the revolutionizing of generative modeling in numerous image synthesis tasks. Nevertheless, it is not trivial to directly apply diffusion models for synthesizing an image of a target person wearing a given in-shop garment, i.e., image-based virtual try-on (VTON) task. The difficulty originates from the aspect that the diffusion process should not only produce holistically high-fidelity photorealistic image of the target person, but also locally preserve every appearance and texture detail of the given garment. To address this, we shape a new Diffusion model, namely GarDiff, which triggers the garment-focused diffusion process with amplified guidance of both basic visual appearance and detailed textures (i.e., high-frequency details) derived from the given garment. GarDiff first remoulds a pre-trained latent diffusion model with additional appearance priors derived from the CLIP and VAE encodings of the reference garment. Meanwhile, a novel garment-focused adapter is integrated into the UNet of diffusion model, pursuing local fine-grained alignment with the visual appearance of reference garment and human pose. We specifically design an appearance loss over the synthesized garment to enhance the crucial, high-frequency details. Extensive experiments on VITON-HD and DressCode datasets demonstrate the superiority of our GarDiff when compared to state-of-the-art VTON approaches. Code is publicly available at: \href{https://github.com/siqi0905/GarDiff/tree/master}{https://github.com/siqi0905/GarDiff/tree/master}.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

arXiv - CS - Multimedia

自引率

0.00%

发文量