Scene Graph Driven Text-Prompt Generation for Image Inpainting

2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) Pub Date : 2023-06-01 DOI:10.1109/CVPRW59228.2023.00083

Tripti Shukla, Paridhi Maheshwari, Rajhans Singh, Ankit Shukla, K. Kulkarni, P. Turaga

{"title":"Scene Graph Driven Text-Prompt Generation for Image Inpainting","authors":"Tripti Shukla, Paridhi Maheshwari, Rajhans Singh, Ankit Shukla, K. Kulkarni, P. Turaga","doi":"10.1109/CVPRW59228.2023.00083","DOIUrl":null,"url":null,"abstract":"Scene editing methods are undergoing a revolution, driven by text-to-image synthesis methods. Applications in media content generation have benefited from a careful set of engineered text prompts, that have been arrived at by the artists by trial and error. There is a growing need to better model prompt generation, for it to be useful for a broad range of consumer-grade applications. We propose a novel method for text prompt generation for the explicit purpose of consumer-grade image inpainting, i.e. insertion of new objects into missing regions in an image. Our approach leverages existing inter-object relationships to generate plausible textual descriptions for the missing object, that can then be used with any text-to-image generator. Given an image and a location where a new object is to be inserted, our approach first converts the given image to an intermediate scene graph. Then, we use graph convolutional networks to ‘expand’ the scene graph by predicting the identity and relationships of the new object to be inserted, with respect to the existing objects in the scene. The output of the expanded scene graph is cast into a textual description, which is then processed by a text-to-image generator, conditioned on the given image, to produce the final inpainted image. We conduct extensive experiments on the Visual Genome dataset, and show through qualitative and quantitative metrics that our method is superior to other methods.","PeriodicalId":355438,"journal":{"name":"2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)","volume":"2014 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CVPRW59228.2023.00083","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Scene editing methods are undergoing a revolution, driven by text-to-image synthesis methods. Applications in media content generation have benefited from a careful set of engineered text prompts, that have been arrived at by the artists by trial and error. There is a growing need to better model prompt generation, for it to be useful for a broad range of consumer-grade applications. We propose a novel method for text prompt generation for the explicit purpose of consumer-grade image inpainting, i.e. insertion of new objects into missing regions in an image. Our approach leverages existing inter-object relationships to generate plausible textual descriptions for the missing object, that can then be used with any text-to-image generator. Given an image and a location where a new object is to be inserted, our approach first converts the given image to an intermediate scene graph. Then, we use graph convolutional networks to ‘expand’ the scene graph by predicting the identity and relationships of the new object to be inserted, with respect to the existing objects in the scene. The output of the expanded scene graph is cast into a textual description, which is then processed by a text-to-image generator, conditioned on the given image, to produce the final inpainted image. We conduct extensive experiments on the Visual Genome dataset, and show through qualitative and quantitative metrics that our method is superior to other methods.

查看原文本刊更多论文

场景图形驱动文本提示生成图像绘制

在文本到图像合成方法的推动下，场景编辑方法正在经历一场革命。媒体内容生成的应用程序得益于一组精心设计的文本提示，这些提示是艺术家通过反复试验得出的。越来越需要对提示符生成进行更好的建模，以使其对广泛的消费级应用程序有用。我们提出了一种新的文本提示生成方法，用于消费级图像绘制的明确目的，即在图像的缺失区域插入新的对象。我们的方法利用现有的对象间关系为缺失的对象生成可信的文本描述，然后可以与任何文本到图像生成器一起使用。给定图像和要插入新对象的位置，我们的方法首先将给定图像转换为中间场景图。然后，我们使用图卷积网络来“扩展”场景图，通过预测要插入的新对象的身份和关系，相对于场景中的现有对象。扩展场景图的输出被转换为文本描述，然后由文本到图像生成器处理，以给定的图像为条件，生成最终的嵌入图像。我们在Visual Genome数据集上进行了大量的实验，并通过定性和定量指标表明我们的方法优于其他方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)

自引率

0.00%

发文量