Sketch-guided scene image generation with diffusion model

IF 2.5 4区计算机科学 Q2 COMPUTER SCIENCE, SOFTWARE ENGINEERING

Computers & Graphics-Uk Pub Date : 2025-04-26 DOI:10.1016/j.cag.2025.104226

Tianyu Zhang, Xiaoxuan Xie, Xusheng Du, Haoran Xie

{"title":"Sketch-guided scene image generation with diffusion model","authors":"Tianyu Zhang, Xiaoxuan Xie, Xusheng Du, Haoran Xie","doi":"10.1016/j.cag.2025.104226","DOIUrl":null,"url":null,"abstract":"<div><div>Text-to-image models showcase the impressive ability to generate high-quality and diverse images. However, the transition from freehand sketches to complex scene images with multiple objects remains challenging in computer graphics. In this study, we propose a novel sketch-guided scene image generation framework, decomposing the task of scene image generation from sketch inputs into object-level cross-domain generation and scene-level image construction steps. We first employ a pre-trained diffusion model to convert each single object drawing into a separate image, which can infer additional image details while maintaining the sparse sketch structure. To preserve the conceptual fidelity of the foreground during scene generation, we invert the visual features of object images into identity embeddings for scene generation. For scene-level image construction, we generate the latent representation of the scene image using the separated background prompts. Then, we blend the generated foreground objects with the background image guided by the layout of sketch inputs. We infer the scene image on the blended latent representation using a global prompt with the trained identity tokens to ensure the foreground objects’ details remain unchanged while naturally composing the scene image. Through qualitative and quantitative experiments, we demonstrated that the proposed method’s ability surpasses the state-of-the-art approaches for scene image generation from hand-drawn sketches.</div></div>","PeriodicalId":50628,"journal":{"name":"Computers & Graphics-Uk","volume":"129 ","pages":"Article 104226"},"PeriodicalIF":2.5000,"publicationDate":"2025-04-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computers & Graphics-Uk","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0097849325000676","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}

引用次数: 0

Abstract

Text-to-image models showcase the impressive ability to generate high-quality and diverse images. However, the transition from freehand sketches to complex scene images with multiple objects remains challenging in computer graphics. In this study, we propose a novel sketch-guided scene image generation framework, decomposing the task of scene image generation from sketch inputs into object-level cross-domain generation and scene-level image construction steps. We first employ a pre-trained diffusion model to convert each single object drawing into a separate image, which can infer additional image details while maintaining the sparse sketch structure. To preserve the conceptual fidelity of the foreground during scene generation, we invert the visual features of object images into identity embeddings for scene generation. For scene-level image construction, we generate the latent representation of the scene image using the separated background prompts. Then, we blend the generated foreground objects with the background image guided by the layout of sketch inputs. We infer the scene image on the blended latent representation using a global prompt with the trained identity tokens to ensure the foreground objects’ details remain unchanged while naturally composing the scene image. Through qualitative and quantitative experiments, we demonstrated that the proposed method’s ability surpasses the state-of-the-art approaches for scene image generation from hand-drawn sketches.

查看原文本刊更多论文

用扩散模型生成草图引导的场景图像

文本到图像模型展示了令人印象深刻的生成高质量和多样化图像的能力。然而，从手绘草图到具有多个对象的复杂场景图像的过渡在计算机图形学中仍然具有挑战性。在这项研究中，我们提出了一种新的素描引导场景图像生成框架，将从草图输入到场景图像生成的任务分解为对象级跨域生成和场景级图像构建两个步骤。我们首先使用预训练的扩散模型将每个单独的物体图像转换为单独的图像，该模型可以在保持稀疏草图结构的同时推断出额外的图像细节。为了在场景生成过程中保持前景的概念保真度，我们将物体图像的视觉特征转化为场景生成的身份嵌入。对于场景级图像构建，我们使用分离的背景提示生成场景图像的潜在表示。然后，根据草图输入的布局，将生成的前景对象与背景图像进行融合。我们使用全局提示和训练好的身份令牌在混合潜在表示上推断场景图像，以确保前景物体的细节保持不变，同时自然地构成场景图像。通过定性和定量实验，我们证明了该方法的能力超过了从手绘草图生成场景图像的最先进方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Computers & Graphics-Uk 工程技术-计算机：软件工程

CiteScore

5.30

自引率

12.00%

发文量

173

审稿时长

38 days

期刊介绍： Computers & Graphics is dedicated to disseminate information on research and applications of computer graphics (CG) techniques. The journal encourages articles on: 1. Research and applications of interactive computer graphics. We are particularly interested in novel interaction techniques and applications of CG to problem domains. 2. State-of-the-art papers on late-breaking, cutting-edge research on CG. 3. Information on innovative uses of graphics principles and technologies. 4. Tutorial papers on both teaching CG principles and innovative uses of CG in education.