SinWaveFusion: Learning a single image diffusion model in wavelet domain

IF 4.2 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Image and Vision Computing Pub Date : 2025-04-30 DOI:10.1016/j.imavis.2025.105551

Jisoo Kim , Jiwoo Kang , Taewan Kim , Heeseok Oh

{"title":"SinWaveFusion: Learning a single image diffusion model in wavelet domain","authors":"Jisoo Kim , Jiwoo Kang , Taewan Kim , Heeseok Oh","doi":"10.1016/j.imavis.2025.105551","DOIUrl":null,"url":null,"abstract":"<div><div>Although recent advancements in large-scale image generation models have substantially improved visual fidelity and reliability, current diffusion models continue to encounter significant challenges in maintaining stylistic consistency with the original images. These challenges stem primarily from the intrinsic stochastic nature of the diffusion process, leading to noticeable variability and inconsistency in edited outputs. To address these challenges, this paper proposes a novel framework termed <em>single image wavelet diffusion (SinWaveFusion)</em>, explicitly designed to enhance the consistency and fidelity in generating images derived from a single source image while also mitigating information leakage. SinWaveFusion addresses generative artifacts by employing the multi-scale properties inherent in wavelet decomposition, which incorporates a built-in up-down scaling mechanism. This approach enables refined image manipulation while enhancing stylistic coherence. The proposed diffusion model, trained exclusively on a single source image, utilizes the hierarchical structure of wavelet subbands to effectively capture spatial and spectral information in the sampling process, minimizing reconstruction loss and ensuring high-quality, diverse outputs. Moreover, the architecture of the denoiser features a reduced receptive field, strategically preventing the model from memorizing the entire training image and thereby offering additional computational efficiency benefits. Experimental results demonstrate that SinWaveFusion achieves improved performance in both conditional and unconditional generation compared to existing generative models trained on a single image.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"159 ","pages":"Article 105551"},"PeriodicalIF":4.2000,"publicationDate":"2025-04-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Image and Vision Computing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0262885625001398","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Although recent advancements in large-scale image generation models have substantially improved visual fidelity and reliability, current diffusion models continue to encounter significant challenges in maintaining stylistic consistency with the original images. These challenges stem primarily from the intrinsic stochastic nature of the diffusion process, leading to noticeable variability and inconsistency in edited outputs. To address these challenges, this paper proposes a novel framework termed single image wavelet diffusion (SinWaveFusion), explicitly designed to enhance the consistency and fidelity in generating images derived from a single source image while also mitigating information leakage. SinWaveFusion addresses generative artifacts by employing the multi-scale properties inherent in wavelet decomposition, which incorporates a built-in up-down scaling mechanism. This approach enables refined image manipulation while enhancing stylistic coherence. The proposed diffusion model, trained exclusively on a single source image, utilizes the hierarchical structure of wavelet subbands to effectively capture spatial and spectral information in the sampling process, minimizing reconstruction loss and ensuring high-quality, diverse outputs. Moreover, the architecture of the denoiser features a reduced receptive field, strategically preventing the model from memorizing the entire training image and thereby offering additional computational efficiency benefits. Experimental results demonstrate that SinWaveFusion achieves improved performance in both conditional and unconditional generation compared to existing generative models trained on a single image.

Abstract Image

查看原文本刊更多论文

SinWaveFusion：学习小波域的单幅图像扩散模型

尽管最近大规模图像生成模型的进步大大提高了视觉保真度和可靠性，但目前的扩散模型在保持与原始图像的风格一致性方面仍然面临重大挑战。这些挑战主要源于扩散过程固有的随机性，导致编辑输出中明显的可变性和不一致性。为了解决这些挑战，本文提出了一种称为单图像小波扩散（SinWaveFusion）的新框架，旨在增强从单源图像生成图像的一致性和保真度，同时减轻信息泄漏。SinWaveFusion通过采用小波分解固有的多尺度特性来解决生成伪像问题，该特性包含内置的上下缩放机制。这种方法可以在增强风格一致性的同时进行精细的图像处理。所提出的扩散模型仅在单源图像上进行训练，利用小波子带的分层结构在采样过程中有效捕获空间和光谱信息，最大限度地减少重建损失，并确保高质量、多样化的输出。此外，去噪器的结构具有减少的接受域，策略性地防止模型记忆整个训练图像，从而提供额外的计算效率优势。实验结果表明，与现有的单图像生成模型相比，SinWaveFusion在条件生成和无条件生成方面都取得了更好的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Image and Vision Computing 工程技术-工程：电子与电气

CiteScore

8.50

自引率

8.50%

发文量

143

审稿时长

7.8 months

期刊介绍： Image and Vision Computing has as a primary aim the provision of an effective medium of interchange for the results of high quality theoretical and applied research fundamental to all aspects of image interpretation and computer vision. The journal publishes work that proposes new image interpretation and computer vision methodology or addresses the application of such methods to real world scenes. It seeks to strengthen a deeper understanding in the discipline by encouraging the quantitative comparison and performance evaluation of the proposed methodology. The coverage includes: image interpretation, scene modelling, object recognition and tracking, shape analysis, monitoring and surveillance, active vision and robotic systems, SLAM, biologically-inspired computer vision, motion analysis, stereo vision, document image understanding, character and handwritten text recognition, face and gesture recognition, biometrics, vision-based human-computer interaction, human activity and behavior understanding, data fusion from multiple sensor inputs, image databases.