Siqi Wan, Yehao Li, Jingwen Chen, Yingwei Pan, Ting Yao, Yang Cao, Tao Mei
{"title":"Improving Virtual Try-On with Garment-focused Diffusion Models","authors":"Siqi Wan, Yehao Li, Jingwen Chen, Yingwei Pan, Ting Yao, Yang Cao, Tao Mei","doi":"arxiv-2409.08258","DOIUrl":null,"url":null,"abstract":"Diffusion models have led to the revolutionizing of generative modeling in\nnumerous image synthesis tasks. Nevertheless, it is not trivial to directly\napply diffusion models for synthesizing an image of a target person wearing a\ngiven in-shop garment, i.e., image-based virtual try-on (VTON) task. The\ndifficulty originates from the aspect that the diffusion process should not\nonly produce holistically high-fidelity photorealistic image of the target\nperson, but also locally preserve every appearance and texture detail of the\ngiven garment. To address this, we shape a new Diffusion model, namely GarDiff,\nwhich triggers the garment-focused diffusion process with amplified guidance of\nboth basic visual appearance and detailed textures (i.e., high-frequency\ndetails) derived from the given garment. GarDiff first remoulds a pre-trained\nlatent diffusion model with additional appearance priors derived from the CLIP\nand VAE encodings of the reference garment. Meanwhile, a novel garment-focused\nadapter is integrated into the UNet of diffusion model, pursuing local\nfine-grained alignment with the visual appearance of reference garment and\nhuman pose. We specifically design an appearance loss over the synthesized\ngarment to enhance the crucial, high-frequency details. Extensive experiments\non VITON-HD and DressCode datasets demonstrate the superiority of our GarDiff\nwhen compared to state-of-the-art VTON approaches. Code is publicly available\nat:\n\\href{https://github.com/siqi0905/GarDiff/tree/master}{https://github.com/siqi0905/GarDiff/tree/master}.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"12 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Multimedia","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.08258","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Diffusion models have led to the revolutionizing of generative modeling in
numerous image synthesis tasks. Nevertheless, it is not trivial to directly
apply diffusion models for synthesizing an image of a target person wearing a
given in-shop garment, i.e., image-based virtual try-on (VTON) task. The
difficulty originates from the aspect that the diffusion process should not
only produce holistically high-fidelity photorealistic image of the target
person, but also locally preserve every appearance and texture detail of the
given garment. To address this, we shape a new Diffusion model, namely GarDiff,
which triggers the garment-focused diffusion process with amplified guidance of
both basic visual appearance and detailed textures (i.e., high-frequency
details) derived from the given garment. GarDiff first remoulds a pre-trained
latent diffusion model with additional appearance priors derived from the CLIP
and VAE encodings of the reference garment. Meanwhile, a novel garment-focused
adapter is integrated into the UNet of diffusion model, pursuing local
fine-grained alignment with the visual appearance of reference garment and
human pose. We specifically design an appearance loss over the synthesized
garment to enhance the crucial, high-frequency details. Extensive experiments
on VITON-HD and DressCode datasets demonstrate the superiority of our GarDiff
when compared to state-of-the-art VTON approaches. Code is publicly available
at:
\href{https://github.com/siqi0905/GarDiff/tree/master}{https://github.com/siqi0905/GarDiff/tree/master}.