{"title":"SinWaveFusion: Learning a single image diffusion model in wavelet domain","authors":"Jisoo Kim , Jiwoo Kang , Taewan Kim , Heeseok Oh","doi":"10.1016/j.imavis.2025.105551","DOIUrl":null,"url":null,"abstract":"<div><div>Although recent advancements in large-scale image generation models have substantially improved visual fidelity and reliability, current diffusion models continue to encounter significant challenges in maintaining stylistic consistency with the original images. These challenges stem primarily from the intrinsic stochastic nature of the diffusion process, leading to noticeable variability and inconsistency in edited outputs. To address these challenges, this paper proposes a novel framework termed <em>single image wavelet diffusion (SinWaveFusion)</em>, explicitly designed to enhance the consistency and fidelity in generating images derived from a single source image while also mitigating information leakage. SinWaveFusion addresses generative artifacts by employing the multi-scale properties inherent in wavelet decomposition, which incorporates a built-in up-down scaling mechanism. This approach enables refined image manipulation while enhancing stylistic coherence. The proposed diffusion model, trained exclusively on a single source image, utilizes the hierarchical structure of wavelet subbands to effectively capture spatial and spectral information in the sampling process, minimizing reconstruction loss and ensuring high-quality, diverse outputs. Moreover, the architecture of the denoiser features a reduced receptive field, strategically preventing the model from memorizing the entire training image and thereby offering additional computational efficiency benefits. Experimental results demonstrate that SinWaveFusion achieves improved performance in both conditional and unconditional generation compared to existing generative models trained on a single image.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"159 ","pages":"Article 105551"},"PeriodicalIF":4.2000,"publicationDate":"2025-04-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Image and Vision Computing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0262885625001398","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Although recent advancements in large-scale image generation models have substantially improved visual fidelity and reliability, current diffusion models continue to encounter significant challenges in maintaining stylistic consistency with the original images. These challenges stem primarily from the intrinsic stochastic nature of the diffusion process, leading to noticeable variability and inconsistency in edited outputs. To address these challenges, this paper proposes a novel framework termed single image wavelet diffusion (SinWaveFusion), explicitly designed to enhance the consistency and fidelity in generating images derived from a single source image while also mitigating information leakage. SinWaveFusion addresses generative artifacts by employing the multi-scale properties inherent in wavelet decomposition, which incorporates a built-in up-down scaling mechanism. This approach enables refined image manipulation while enhancing stylistic coherence. The proposed diffusion model, trained exclusively on a single source image, utilizes the hierarchical structure of wavelet subbands to effectively capture spatial and spectral information in the sampling process, minimizing reconstruction loss and ensuring high-quality, diverse outputs. Moreover, the architecture of the denoiser features a reduced receptive field, strategically preventing the model from memorizing the entire training image and thereby offering additional computational efficiency benefits. Experimental results demonstrate that SinWaveFusion achieves improved performance in both conditional and unconditional generation compared to existing generative models trained on a single image.
期刊介绍:
Image and Vision Computing has as a primary aim the provision of an effective medium of interchange for the results of high quality theoretical and applied research fundamental to all aspects of image interpretation and computer vision. The journal publishes work that proposes new image interpretation and computer vision methodology or addresses the application of such methods to real world scenes. It seeks to strengthen a deeper understanding in the discipline by encouraging the quantitative comparison and performance evaluation of the proposed methodology. The coverage includes: image interpretation, scene modelling, object recognition and tracking, shape analysis, monitoring and surveillance, active vision and robotic systems, SLAM, biologically-inspired computer vision, motion analysis, stereo vision, document image understanding, character and handwritten text recognition, face and gesture recognition, biometrics, vision-based human-computer interaction, human activity and behavior understanding, data fusion from multiple sensor inputs, image databases.