Improving Text-guided Object Inpainting with Semantic Pre-inpainting

Yifu Chen, Jingwen Chen, Yingwei Pan, Yehao Li, Ting Yao, Zhineng Chen, Tao Mei
{"title":"Improving Text-guided Object Inpainting with Semantic Pre-inpainting","authors":"Yifu Chen, Jingwen Chen, Yingwei Pan, Yehao Li, Ting Yao, Zhineng Chen, Tao Mei","doi":"arxiv-2409.08260","DOIUrl":null,"url":null,"abstract":"Recent years have witnessed the success of large text-to-image diffusion\nmodels and their remarkable potential to generate high-quality images. The\nfurther pursuit of enhancing the editability of images has sparked significant\ninterest in the downstream task of inpainting a novel object described by a\ntext prompt within a designated region in the image. Nevertheless, the problem\nis not trivial from two aspects: 1) Solely relying on one single U-Net to align\ntext prompt and visual object across all the denoising timesteps is\ninsufficient to generate desired objects; 2) The controllability of object\ngeneration is not guaranteed in the intricate sampling space of diffusion\nmodel. In this paper, we propose to decompose the typical single-stage object\ninpainting into two cascaded processes: 1) semantic pre-inpainting that infers\nthe semantic features of desired objects in a multi-modal feature space; 2)\nhigh-fieldity object generation in diffusion latent space that pivots on such\ninpainted semantic features. To achieve this, we cascade a Transformer-based\nsemantic inpainter and an object inpainting diffusion model, leading to a novel\nCAscaded Transformer-Diffusion (CAT-Diffusion) framework for text-guided object\ninpainting. Technically, the semantic inpainter is trained to predict the\nsemantic features of the target object conditioning on unmasked context and\ntext prompt. The outputs of the semantic inpainter then act as the informative\nvisual prompts to guide high-fieldity object generation through a reference\nadapter layer, leading to controllable object inpainting. Extensive evaluations\non OpenImages-V6 and MSCOCO validate the superiority of CAT-Diffusion against\nthe state-of-the-art methods. Code is available at\n\\url{https://github.com/Nnn-s/CATdiffusion}.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"26 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Multimedia","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.08260","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Recent years have witnessed the success of large text-to-image diffusion models and their remarkable potential to generate high-quality images. The further pursuit of enhancing the editability of images has sparked significant interest in the downstream task of inpainting a novel object described by a text prompt within a designated region in the image. Nevertheless, the problem is not trivial from two aspects: 1) Solely relying on one single U-Net to align text prompt and visual object across all the denoising timesteps is insufficient to generate desired objects; 2) The controllability of object generation is not guaranteed in the intricate sampling space of diffusion model. In this paper, we propose to decompose the typical single-stage object inpainting into two cascaded processes: 1) semantic pre-inpainting that infers the semantic features of desired objects in a multi-modal feature space; 2) high-fieldity object generation in diffusion latent space that pivots on such inpainted semantic features. To achieve this, we cascade a Transformer-based semantic inpainter and an object inpainting diffusion model, leading to a novel CAscaded Transformer-Diffusion (CAT-Diffusion) framework for text-guided object inpainting. Technically, the semantic inpainter is trained to predict the semantic features of the target object conditioning on unmasked context and text prompt. The outputs of the semantic inpainter then act as the informative visual prompts to guide high-fieldity object generation through a reference adapter layer, leading to controllable object inpainting. Extensive evaluations on OpenImages-V6 and MSCOCO validate the superiority of CAT-Diffusion against the state-of-the-art methods. Code is available at \url{https://github.com/Nnn-s/CATdiffusion}.
利用语义预绘制改进文本引导的对象绘制
近年来,大型文本到图像扩散模型取得了成功,并在生成高质量图像方面具有显著的潜力。为了进一步提高图像的可编辑性,人们对在图像指定区域内插入文本提示所描述的新对象这一下游任务产生了浓厚的兴趣。然而,从两个方面来看,这个问题并不简单:1) 在所有去噪时间步中,仅依靠一个 U-Net 将文本提示和视觉对象对齐不足以生成所需的对象;2) 在扩散模型错综复杂的采样空间中,无法保证对象生成的可控性。在本文中,我们建议将典型的单阶段对象绘制分解为两个级联过程:1) 语义预绘制,即在多模态特征空间中推断所需对象的语义特征;2) 在扩散潜空间中生成高场度对象,该过程以这些绘制的语义特征为中心。为此,我们级联了一个基于变换器的语义绘制器和一个对象绘制扩散模型,从而形成了一个用于文本引导的对象绘制的新型级联变换器-扩散(CAT-Diffusion)框架。从技术上讲,语义绘制器通过训练来预测目标对象的语义特征,并以未屏蔽的上下文和文本提示为条件。然后,语义绘制器的输出作为信息视觉提示,通过参考适配器层引导高场度对象生成,从而实现可控的对象绘制。在 OpenImages-V6 和 MSCOCO 上进行的广泛评估验证了 CAT-Diffusion 优于最先进的方法。代码请访问:url{https://github.com/Nnn-s/CATdiffusion}。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信