Fine-Grained Semantic Image Synthesis with Object-Attention Generative Adversarial Network

ACM Transactions on Intelligent Systems and Technology (TIST) Pub Date : 2021-10-31 DOI:10.1145/3470008

Min Wang, Congyan Lang, Liqian Liang, Songhe Feng, Tao Wang, Yutong Gao

{"title":"Fine-Grained Semantic Image Synthesis with Object-Attention Generative Adversarial Network","authors":"Min Wang, Congyan Lang, Liqian Liang, Songhe Feng, Tao Wang, Yutong Gao","doi":"10.1145/3470008","DOIUrl":null,"url":null,"abstract":"Semantic image synthesis is a new rising and challenging vision problem accompanied by the recent promising advances in generative adversarial networks. The existing semantic image synthesis methods only consider the global information provided by the semantic segmentation mask, such as class label, global layout, and location, so the generative models cannot capture the rich local fine-grained information of the images (e.g., object structure, contour, and texture). To address this issue, we adopt a multi-scale feature fusion algorithm to refine the generated images by learning the fine-grained information of the local objects. We propose OA-GAN, a novel object-attention generative adversarial network that allows attention-driven, multi-fusion refinement for fine-grained semantic image synthesis. Specifically, the proposed model first generates multi-scale global image features and local object features, respectively, then the local object features are fused into the global image features to improve the correlation between the local and the global. In the process of feature fusion, the global image features and the local object features are fused through the channel-spatial-wise fusion block to learn ‘what’ and ‘where’ to attend in the channel and spatial axes, respectively. The fused features are used to construct correlation filters to obtain feature response maps to determine the locations, contours, and textures of the objects. Extensive quantitative and qualitative experiments on COCO-Stuff, ADE20K and Cityscapes datasets demonstrate that our OA-GAN significantly outperforms the state-of-the-art methods.","PeriodicalId":123526,"journal":{"name":"ACM Transactions on Intelligent Systems and Technology (TIST)","volume":"59 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Intelligent Systems and Technology (TIST)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3470008","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

Abstract

Semantic image synthesis is a new rising and challenging vision problem accompanied by the recent promising advances in generative adversarial networks. The existing semantic image synthesis methods only consider the global information provided by the semantic segmentation mask, such as class label, global layout, and location, so the generative models cannot capture the rich local fine-grained information of the images (e.g., object structure, contour, and texture). To address this issue, we adopt a multi-scale feature fusion algorithm to refine the generated images by learning the fine-grained information of the local objects. We propose OA-GAN, a novel object-attention generative adversarial network that allows attention-driven, multi-fusion refinement for fine-grained semantic image synthesis. Specifically, the proposed model first generates multi-scale global image features and local object features, respectively, then the local object features are fused into the global image features to improve the correlation between the local and the global. In the process of feature fusion, the global image features and the local object features are fused through the channel-spatial-wise fusion block to learn ‘what’ and ‘where’ to attend in the channel and spatial axes, respectively. The fused features are used to construct correlation filters to obtain feature response maps to determine the locations, contours, and textures of the objects. Extensive quantitative and qualitative experiments on COCO-Stuff, ADE20K and Cityscapes datasets demonstrate that our OA-GAN significantly outperforms the state-of-the-art methods.

查看原文本刊更多论文

基于对象-注意力生成对抗网络的细粒度语义图像合成

语义图像合成是随着生成对抗网络的发展而出现的一个新的、具有挑战性的视觉问题。现有的语义图像合成方法仅考虑语义分割掩码提供的全局信息，如类标签、全局布局、位置等，生成模型无法捕获图像中丰富的局部细粒度信息(如物体结构、轮廓、纹理等)。为了解决这一问题，我们采用多尺度特征融合算法，通过学习局部物体的细粒度信息来细化生成的图像。我们提出了一种新的对象-注意力生成对抗网络OA-GAN，它允许对细粒度语义图像合成进行注意力驱动的多融合细化。具体来说，该模型首先分别生成多尺度的全局图像特征和局部目标特征，然后将局部目标特征融合到全局图像特征中，以提高局部和全局的相关性。在特征融合过程中，通过通道-空间融合块融合全局图像特征和局部目标特征，分别学习在通道和空间轴上参与“什么”和“哪里”。利用融合后的特征构建相关滤波器，得到特征响应图，确定目标的位置、轮廓和纹理。在COCO-Stuff、ADE20K和cityscape数据集上进行的大量定量和定性实验表明，我们的OA-GAN显著优于最先进的方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

ACM Transactions on Intelligent Systems and Technology (TIST)

自引率

0.00%

发文量