{"title":"GLV: Geometric Correlation Distillation for Latent Diffusion-Enhanced Parser-Free Virtual Try-On","authors":"Chenghu Du;Junyin Wang;Kai Liu;Shengwu Xiong","doi":"10.1109/TCSVT.2025.3556749","DOIUrl":null,"url":null,"abstract":"Applying knowledge distillation to virtual try-on tasks is challenging because current methods fail to fully and efficiently exploit responsible teacher knowledge. In other words, existing approaches merely transfer prior knowledge to the student model via pseudo-labels generated by the teacher model, resulting in shallow knowledge representation and low training efficiency. To address these limitations, we propose a novel teacher-student architecture for parser-free virtual try-on, named GLV, which generates high-quality try-on results with realistic body details. Specifically, we propose a deformation-related prior distillation method to effectively leverage the valuable deformation information contained in the teacher warpage model. This enhances the convergence efficiency of the student warpage model, preventing it from getting stuck in a local minima. Moreover, we are the first to propose a geometric correlation distillation, which models the underlying geometric relationship between clothing and the person and transfers this relationship from the teacher to the student. This enables the student warpage model to reduce the entanglement of deformation-irrelevant features, such as color and texture. Finally, we propose a clothing-body retouching method for try-on result synthesis, which refines the denoising process in the latent space of a well-trained diffusion model, thereby preventing catastrophic forgetting. This method seamlessly transforms the parser-based inpainting synthesis paradigm into a parser-free synthesis paradigm and enables efficient convergence of the diffusion model with only fine-tuning. Extensive experiments demonstrate the generality of our approach and highlight its superiority over previous methods.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 9","pages":"9175-9189"},"PeriodicalIF":11.1000,"publicationDate":"2025-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Circuits and Systems for Video Technology","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10947108/","RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
引用次数: 0
Abstract
Applying knowledge distillation to virtual try-on tasks is challenging because current methods fail to fully and efficiently exploit responsible teacher knowledge. In other words, existing approaches merely transfer prior knowledge to the student model via pseudo-labels generated by the teacher model, resulting in shallow knowledge representation and low training efficiency. To address these limitations, we propose a novel teacher-student architecture for parser-free virtual try-on, named GLV, which generates high-quality try-on results with realistic body details. Specifically, we propose a deformation-related prior distillation method to effectively leverage the valuable deformation information contained in the teacher warpage model. This enhances the convergence efficiency of the student warpage model, preventing it from getting stuck in a local minima. Moreover, we are the first to propose a geometric correlation distillation, which models the underlying geometric relationship between clothing and the person and transfers this relationship from the teacher to the student. This enables the student warpage model to reduce the entanglement of deformation-irrelevant features, such as color and texture. Finally, we propose a clothing-body retouching method for try-on result synthesis, which refines the denoising process in the latent space of a well-trained diffusion model, thereby preventing catastrophic forgetting. This method seamlessly transforms the parser-based inpainting synthesis paradigm into a parser-free synthesis paradigm and enables efficient convergence of the diffusion model with only fine-tuning. Extensive experiments demonstrate the generality of our approach and highlight its superiority over previous methods.
期刊介绍:
The IEEE Transactions on Circuits and Systems for Video Technology (TCSVT) is dedicated to covering all aspects of video technologies from a circuits and systems perspective. We encourage submissions of general, theoretical, and application-oriented papers related to image and video acquisition, representation, presentation, and display. Additionally, we welcome contributions in areas such as processing, filtering, and transforms; analysis and synthesis; learning and understanding; compression, transmission, communication, and networking; as well as storage, retrieval, indexing, and search. Furthermore, papers focusing on hardware and software design and implementation are highly valued. Join us in advancing the field of video technology through innovative research and insights.