Sketch-Guided Text-to-Image Diffusion Models

ACM SIGGRAPH 2023 Conference Proceedings Pub Date : 2022-11-24 DOI:10.1145/3588432.3591560

A. Voynov, Kfir Aberman, D. Cohen-Or

{"title":"Sketch-Guided Text-to-Image Diffusion Models","authors":"A. Voynov, Kfir Aberman, D. Cohen-Or","doi":"10.1145/3588432.3591560","DOIUrl":null,"url":null,"abstract":"Text-to-Image models have introduced a remarkable leap in the evolution of machine learning, demonstrating high-quality synthesis of images from a given text-prompt. However, these powerful pretrained models still lack control handles that can guide spatial properties of the synthesized images. In this work, we introduce a universal approach to guide a pretrained text-to-image diffusion model, with a spatial map from another domain (e.g., sketch) during inference time. Unlike previous works, our method does not require to train a dedicated model or a specialized encoder for the task. Our key idea is to train a Latent Guidance Predictor (LGP) - a small, per-pixel, Multi-Layer Perceptron (MLP) that maps latent features of noisy images to spatial maps, where the deep features are extracted from the core Denoising Diffusion Probabilistic Model (DDPM) network. The LGP is trained only on a few thousand images and constitutes a differential guiding map predictor, over which the loss is computed and propagated back to push the intermediate images to agree with the spatial map. The per-pixel training offers flexibility and locality which allows the technique to perform well on out-of-domain sketches, including free-hand style drawings. We take a particular focus on the sketch-to-image translation task, revealing a robust and expressive way to generate images that follow the guidance of a sketch of arbitrary style or domain.","PeriodicalId":280036,"journal":{"name":"ACM SIGGRAPH 2023 Conference Proceedings","volume":"15 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-11-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"67","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM SIGGRAPH 2023 Conference Proceedings","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3588432.3591560","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 67

Abstract

Text-to-Image models have introduced a remarkable leap in the evolution of machine learning, demonstrating high-quality synthesis of images from a given text-prompt. However, these powerful pretrained models still lack control handles that can guide spatial properties of the synthesized images. In this work, we introduce a universal approach to guide a pretrained text-to-image diffusion model, with a spatial map from another domain (e.g., sketch) during inference time. Unlike previous works, our method does not require to train a dedicated model or a specialized encoder for the task. Our key idea is to train a Latent Guidance Predictor (LGP) - a small, per-pixel, Multi-Layer Perceptron (MLP) that maps latent features of noisy images to spatial maps, where the deep features are extracted from the core Denoising Diffusion Probabilistic Model (DDPM) network. The LGP is trained only on a few thousand images and constitutes a differential guiding map predictor, over which the loss is computed and propagated back to push the intermediate images to agree with the spatial map. The per-pixel training offers flexibility and locality which allows the technique to perform well on out-of-domain sketches, including free-hand style drawings. We take a particular focus on the sketch-to-image translation task, revealing a robust and expressive way to generate images that follow the guidance of a sketch of arbitrary style or domain.

查看原文本刊更多论文

草图引导文本到图像扩散模型

文本到图像模型在机器学习的发展中带来了一个显著的飞跃，展示了从给定的文本提示中合成高质量的图像。然而，这些强大的预训练模型仍然缺乏能够指导合成图像空间属性的控制手柄。在这项工作中，我们引入了一种通用方法来指导预训练的文本到图像扩散模型，在推理时间内使用来自另一个领域(例如草图)的空间地图。与以前的工作不同，我们的方法不需要为任务训练专门的模型或专门的编码器。我们的关键思想是训练一个潜在制导预测器(LGP)——一个小的、逐像素的多层感知器(MLP)，它将噪声图像的潜在特征映射到空间地图，其中深度特征是从核心去噪扩散概率模型(DDPM)网络中提取的。LGP只在几千张图像上进行训练，并构成一个微分引导地图预测器，在此基础上计算损失并传播回去，以推动中间图像与空间地图一致。逐像素训练提供了灵活性和局部性，使该技术能够在域外草图上表现良好，包括徒手风格的绘画。我们特别关注草图到图像的翻译任务，揭示了一种强大而富有表现力的方式来生成遵循任意风格或领域草图指导的图像。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

ACM SIGGRAPH 2023 Conference Proceedings

自引率

0.00%

发文量