Transformer-based Image Generation from Scene Graphs

Comput. Vis. Image Underst. Pub Date : 2023-03-08 DOI:10.48550/arXiv.2303.04634

Renato Sortino, S. Palazzo, C. Spampinato

{"title":"Transformer-based Image Generation from Scene Graphs","authors":"Renato Sortino, S. Palazzo, C. Spampinato","doi":"10.48550/arXiv.2303.04634","DOIUrl":null,"url":null,"abstract":"Graph-structured scene descriptions can be efficiently used in generative models to control the composition of the generated image. Previous approaches are based on the combination of graph convolutional networks and adversarial methods for layout prediction and image generation, respectively. In this work, we show how employing multi-head attention to encode the graph information, as well as using a transformer-based model in the latent space for image generation can improve the quality of the sampled data, without the need to employ adversarial models with the subsequent advantage in terms of training stability. The proposed approach, specifically, is entirely based on transformer architectures both for encoding scene graphs into intermediate object layouts and for decoding these layouts into images, passing through a lower dimensional space learned by a vector-quantized variational autoencoder. Our approach shows an improved image quality with respect to state-of-the-art methods as well as a higher degree of diversity among multiple generations from the same scene graph. We evaluate our approach on three public datasets: Visual Genome, COCO, and CLEVR. We achieve an Inception Score of 13.7 and 12.8, and an FID of 52.3 and 60.3, on COCO and Visual Genome, respectively. We perform ablation studies on our contributions to assess the impact of each component. Code is available at https://github.com/perceivelab/trf-sg2im","PeriodicalId":10549,"journal":{"name":"Comput. Vis. Image Underst.","volume":"37 1","pages":"103721"},"PeriodicalIF":0.0000,"publicationDate":"2023-03-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Comput. Vis. Image Underst.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.48550/arXiv.2303.04634","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

Abstract

Graph-structured scene descriptions can be efficiently used in generative models to control the composition of the generated image. Previous approaches are based on the combination of graph convolutional networks and adversarial methods for layout prediction and image generation, respectively. In this work, we show how employing multi-head attention to encode the graph information, as well as using a transformer-based model in the latent space for image generation can improve the quality of the sampled data, without the need to employ adversarial models with the subsequent advantage in terms of training stability. The proposed approach, specifically, is entirely based on transformer architectures both for encoding scene graphs into intermediate object layouts and for decoding these layouts into images, passing through a lower dimensional space learned by a vector-quantized variational autoencoder. Our approach shows an improved image quality with respect to state-of-the-art methods as well as a higher degree of diversity among multiple generations from the same scene graph. We evaluate our approach on three public datasets: Visual Genome, COCO, and CLEVR. We achieve an Inception Score of 13.7 and 12.8, and an FID of 52.3 and 60.3, on COCO and Visual Genome, respectively. We perform ablation studies on our contributions to assess the impact of each component. Code is available at https://github.com/perceivelab/trf-sg2im

查看原文本刊更多论文

从场景图生成基于变压器的图像

图结构场景描述可以有效地用于生成模型，以控制生成图像的组成。以前的方法是基于图卷积网络和对抗方法的结合，分别用于布局预测和图像生成。在这项工作中，我们展示了如何使用多头注意力来编码图信息，以及在潜在空间中使用基于变压器的模型来生成图像，可以提高采样数据的质量，而不需要使用对抗性模型，从而在训练稳定性方面具有优势。具体来说，所提出的方法完全基于转换器架构，既可以将场景图编码为中间对象布局，也可以将这些布局解码为图像，通过矢量量化变分自编码器学习的低维空间。我们的方法显示了相对于最先进的方法的改进的图像质量，以及来自同一场景图的多代之间更高程度的多样性。我们在三个公共数据集上评估了我们的方法:Visual Genome, COCO和CLEVR。我们在COCO和Visual Genome上分别获得了13.7和12.8的Inception Score，以及52.3和60.3的FID。我们对我们的贡献进行消融研究，以评估每个组成部分的影响。代码可从https://github.com/perceivelab/trf-sg2im获得

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Comput. Vis. Image Underst.

自引率

0.00%

发文量