Multiscale contextual joint feature enhancement GAN for semantic image synthesis

IF 4.2 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Image and Vision Computing Pub Date : 2025-07-02 DOI:10.1016/j.imavis.2025.105637

Hengyou Wang , Rongxin Ma , Xiang Jiang

{"title":"Multiscale contextual joint feature enhancement GAN for semantic image synthesis","authors":"Hengyou Wang , Rongxin Ma , Xiang Jiang","doi":"10.1016/j.imavis.2025.105637","DOIUrl":null,"url":null,"abstract":"<div><div>Semantic image synthesis aims to generate images conditioned on semantic segmentation maps. Existing methods typically employ a generative adversarial framework to combine latent variables with semantic segmentation maps. However, traditional convolutions and feature map complexity often lead to issues such as uneven color, unrealistic textures, and blurred edges in generated images. To address these issues, we propose the Multiscale Contextual Joint Feature Enhancement Generative Adversarial Network, called MSCJ-GAN. Specifically, to capture local details and enhance the global consistency of large-scale objects, a large receptive field feature enhancement module based on Fast Fourier Convolution (FFC) and Transformer is introduced. This module employs an attention mechanism in the frequency domain, enabling neurons in the early layers of the network to access contextual information from the entire image. Furthermore, to ensure clear and realistic textures for objects and their boundaries, a dual-dimensional feature enhancement module based on bias is proposed. This module fully utilizes the statistical features in the feature maps, channel differences, and the detailed expression of the bias matrix to improve the realism of the generated images. Finally, experimental results on three challenging datasets demonstrate that the proposed MSCJ-GAN outperforms state-of-the-art methods, achieving superior performance in generating large-scale objects (e.g., sky and grass) and intricate texture details (e.g., wrinkles and micro-expressions). The code will be released after this work is published: <span><span>https://github.com/xinxin0312/MSCJ-GAN</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"161 ","pages":"Article 105637"},"PeriodicalIF":4.2000,"publicationDate":"2025-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Image and Vision Computing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0262885625002252","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Semantic image synthesis aims to generate images conditioned on semantic segmentation maps. Existing methods typically employ a generative adversarial framework to combine latent variables with semantic segmentation maps. However, traditional convolutions and feature map complexity often lead to issues such as uneven color, unrealistic textures, and blurred edges in generated images. To address these issues, we propose the Multiscale Contextual Joint Feature Enhancement Generative Adversarial Network, called MSCJ-GAN. Specifically, to capture local details and enhance the global consistency of large-scale objects, a large receptive field feature enhancement module based on Fast Fourier Convolution (FFC) and Transformer is introduced. This module employs an attention mechanism in the frequency domain, enabling neurons in the early layers of the network to access contextual information from the entire image. Furthermore, to ensure clear and realistic textures for objects and their boundaries, a dual-dimensional feature enhancement module based on bias is proposed. This module fully utilizes the statistical features in the feature maps, channel differences, and the detailed expression of the bias matrix to improve the realism of the generated images. Finally, experimental results on three challenging datasets demonstrate that the proposed MSCJ-GAN outperforms state-of-the-art methods, achieving superior performance in generating large-scale objects (e.g., sky and grass) and intricate texture details (e.g., wrinkles and micro-expressions). The code will be released after this work is published: https://github.com/xinxin0312/MSCJ-GAN.

查看原文本刊更多论文

面向语义图像合成的多尺度上下文联合特征增强GAN

语义图像合成的目的是生成以语义分割图为条件的图像。现有方法通常采用生成对抗框架将潜在变量与语义分割映射相结合。然而，传统的卷积和特征映射复杂性往往会导致生成图像的颜色不均匀、纹理不真实和边缘模糊等问题。为了解决这些问题，我们提出了多尺度上下文联合特征增强生成对抗网络，称为MSCJ-GAN。具体来说，为了捕获局部细节并增强大规模目标的全局一致性，提出了一种基于快速傅里叶卷积（FFC）和Transformer的大接收场特征增强模块。该模块采用频域注意机制，使网络早期层的神经元能够从整个图像中获取上下文信息。此外，为了保证物体及其边界纹理的清晰逼真，提出了一种基于偏置的二维特征增强模块。该模块充分利用了特征图中的统计特征、通道差异以及偏置矩阵的细化表达，提高了生成图像的真实感。最后，在三个具有挑战性的数据集上的实验结果表明，所提出的MSCJ-GAN优于最先进的方法，在生成大规模物体（如天空和草地）和复杂纹理细节（如皱纹和微表情）方面取得了卓越的性能。代码将在本作品发布后发布：https://github.com/xinxin0312/MSCJ-GAN。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Image and Vision Computing 工程技术-工程：电子与电气

CiteScore

8.50

自引率

8.50%

发文量

143

审稿时长

7.8 months

期刊介绍： Image and Vision Computing has as a primary aim the provision of an effective medium of interchange for the results of high quality theoretical and applied research fundamental to all aspects of image interpretation and computer vision. The journal publishes work that proposes new image interpretation and computer vision methodology or addresses the application of such methods to real world scenes. It seeks to strengthen a deeper understanding in the discipline by encouraging the quantitative comparison and performance evaluation of the proposed methodology. The coverage includes: image interpretation, scene modelling, object recognition and tracking, shape analysis, monitoring and surveillance, active vision and robotic systems, SLAM, biologically-inspired computer vision, motion analysis, stereo vision, document image understanding, character and handwritten text recognition, face and gesture recognition, biometrics, vision-based human-computer interaction, human activity and behavior understanding, data fusion from multiple sensor inputs, image databases.