Improving Virtual Try-On with Garment-focused Diffusion Models

Siqi Wan, Yehao Li, Jingwen Chen, Yingwei Pan, Ting Yao, Yang Cao, Tao Mei
{"title":"Improving Virtual Try-On with Garment-focused Diffusion Models","authors":"Siqi Wan, Yehao Li, Jingwen Chen, Yingwei Pan, Ting Yao, Yang Cao, Tao Mei","doi":"arxiv-2409.08258","DOIUrl":null,"url":null,"abstract":"Diffusion models have led to the revolutionizing of generative modeling in\nnumerous image synthesis tasks. Nevertheless, it is not trivial to directly\napply diffusion models for synthesizing an image of a target person wearing a\ngiven in-shop garment, i.e., image-based virtual try-on (VTON) task. The\ndifficulty originates from the aspect that the diffusion process should not\nonly produce holistically high-fidelity photorealistic image of the target\nperson, but also locally preserve every appearance and texture detail of the\ngiven garment. To address this, we shape a new Diffusion model, namely GarDiff,\nwhich triggers the garment-focused diffusion process with amplified guidance of\nboth basic visual appearance and detailed textures (i.e., high-frequency\ndetails) derived from the given garment. GarDiff first remoulds a pre-trained\nlatent diffusion model with additional appearance priors derived from the CLIP\nand VAE encodings of the reference garment. Meanwhile, a novel garment-focused\nadapter is integrated into the UNet of diffusion model, pursuing local\nfine-grained alignment with the visual appearance of reference garment and\nhuman pose. We specifically design an appearance loss over the synthesized\ngarment to enhance the crucial, high-frequency details. Extensive experiments\non VITON-HD and DressCode datasets demonstrate the superiority of our GarDiff\nwhen compared to state-of-the-art VTON approaches. Code is publicly available\nat:\n\\href{https://github.com/siqi0905/GarDiff/tree/master}{https://github.com/siqi0905/GarDiff/tree/master}.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"12 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Multimedia","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.08258","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Diffusion models have led to the revolutionizing of generative modeling in numerous image synthesis tasks. Nevertheless, it is not trivial to directly apply diffusion models for synthesizing an image of a target person wearing a given in-shop garment, i.e., image-based virtual try-on (VTON) task. The difficulty originates from the aspect that the diffusion process should not only produce holistically high-fidelity photorealistic image of the target person, but also locally preserve every appearance and texture detail of the given garment. To address this, we shape a new Diffusion model, namely GarDiff, which triggers the garment-focused diffusion process with amplified guidance of both basic visual appearance and detailed textures (i.e., high-frequency details) derived from the given garment. GarDiff first remoulds a pre-trained latent diffusion model with additional appearance priors derived from the CLIP and VAE encodings of the reference garment. Meanwhile, a novel garment-focused adapter is integrated into the UNet of diffusion model, pursuing local fine-grained alignment with the visual appearance of reference garment and human pose. We specifically design an appearance loss over the synthesized garment to enhance the crucial, high-frequency details. Extensive experiments on VITON-HD and DressCode datasets demonstrate the superiority of our GarDiff when compared to state-of-the-art VTON approaches. Code is publicly available at: \href{https://github.com/siqi0905/GarDiff/tree/master}{https://github.com/siqi0905/GarDiff/tree/master}.
利用以服装为重点的扩散模型改进虚拟试穿
扩散模型为无数图像合成任务的生成建模带来了革命性的变革。然而,直接应用扩散模型合成目标人物穿着特定店内服装的图像(即基于图像的虚拟试穿(VTON)任务)并非易事。难点在于扩散过程不仅要生成高保真的目标人物整体逼真图像,还要在局部保留服装的每一个外观和纹理细节。为了解决这个问题,我们设计了一种新的扩散模型,即 GarDiff,它可以通过对给定服装的基本视觉外观和细节纹理(即高频细节)的放大引导来触发以服装为中心的扩散过程。GarDiff 首先利用从参考服装的 CLIP 和 VAE 编码中提取的附加外观前验,重塑预先训练好的静态扩散模型。与此同时,一个以服装为重点的新适配器被集成到扩散模型的 UNet 中,以追求与参考服装和人体姿势的视觉外观进行局部精细调整。我们特别设计了合成服装的外观损失,以增强关键的高频细节。在VITON-HD和DressCode数据集上进行的大量实验证明,与最先进的VTON方法相比,我们的GarDiff方法更胜一筹。代码可在以下网址公开获取:\href{https://github.com/siqi0905/GarDiff/tree/master}{https://github.com/siqi0905/GarDiff/tree/master}。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信