Dual-function discriminator for semantic image synthesis in variational GANs

IF 7.5 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Pattern Recognition Pub Date : 2025-04-11 DOI:10.1016/j.patcog.2025.111684

Aihua Ke , Bo Cai , Yujie Huang , Jian Luo , Yaoxiang Yu , Le Li

{"title":"Dual-function discriminator for semantic image synthesis in variational GANs","authors":"Aihua Ke , Bo Cai , Yujie Huang , Jian Luo , Yaoxiang Yu , Le Li","doi":"10.1016/j.patcog.2025.111684","DOIUrl":null,"url":null,"abstract":"<div><div>Semantic image synthesis aims to generate target images conditioned on given semantic labels, but existing methods often struggle with maintaining high visual quality and accurate semantic alignment. To address these challenges, we propose VD-GAN, a novel framework that integrates advanced architectural and functional innovations. Our variational generator, built on an enhanced U-Net architecture combining a pre-trained Swin transformer and CNN, captures both global and local semantic features, generating high-quality images. To further boost performance, we design two innovative modules: the Conditional Residual Attention Module (CRAM) for dimensionality reduction modulation and the Channel and Spatial Attention Mechanism (CSAM) for extracting key semantic relationships across channel and spatial dimensions. Additionally, we introduce a dual-function discriminator that not only distinguishes real and synthesized images, but also performs multi-class segmentation on synthesized images, guided by a redefined class-balanced cross-entropy loss to ensure semantic consistency. Extensive experiments show that VD-GAN outperforms the latest supervised methods, with improvements of (FID, mIoU, Acc) by (5.40%, 4.37%, 1.48%) and increases in auxiliary metrics (LPIPS, TOPIQ) by (2.45%, 23.52%). The code will be available at <span><span>https://github.com/ah-ke/VD-GAN.git</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"166 ","pages":"Article 111684"},"PeriodicalIF":7.5000,"publicationDate":"2025-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Pattern Recognition","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0031320325003449","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Semantic image synthesis aims to generate target images conditioned on given semantic labels, but existing methods often struggle with maintaining high visual quality and accurate semantic alignment. To address these challenges, we propose VD-GAN, a novel framework that integrates advanced architectural and functional innovations. Our variational generator, built on an enhanced U-Net architecture combining a pre-trained Swin transformer and CNN, captures both global and local semantic features, generating high-quality images. To further boost performance, we design two innovative modules: the Conditional Residual Attention Module (CRAM) for dimensionality reduction modulation and the Channel and Spatial Attention Mechanism (CSAM) for extracting key semantic relationships across channel and spatial dimensions. Additionally, we introduce a dual-function discriminator that not only distinguishes real and synthesized images, but also performs multi-class segmentation on synthesized images, guided by a redefined class-balanced cross-entropy loss to ensure semantic consistency. Extensive experiments show that VD-GAN outperforms the latest supervised methods, with improvements of (FID, mIoU, Acc) by (5.40%, 4.37%, 1.48%) and increases in auxiliary metrics (LPIPS, TOPIQ) by (2.45%, 23.52%). The code will be available at https://github.com/ah-ke/VD-GAN.git.

查看原文本刊更多论文

变分gan语义图像合成的双函数鉴别器

语义图像合成的目的是在给定语义标签的条件下生成目标图像，但现有的方法往往难以保持高视觉质量和准确的语义对齐。为了应对这些挑战，我们提出了VD-GAN，这是一种集成了先进架构和功能创新的新框架。我们的变分生成器建立在增强的U-Net架构上，结合了预训练的Swin变压器和CNN，可以捕获全局和局部语义特征，生成高质量的图像。为了进一步提高性能，我们设计了两个创新模块：用于降维调制的条件剩余注意模块（CRAM）和用于提取跨信道和空间维度的关键语义关系的信道和空间注意机制（CSAM）。此外，我们引入了一种双功能鉴别器，它不仅可以区分真实图像和合成图像，还可以对合成图像进行多类分割，并在重新定义的类平衡交叉熵损失的指导下确保语义一致性。大量实验表明，VD-GAN优于最新的监督方法，（FID, mIoU, Acc）提高了（5.40%,4.37%,1.48%），辅助指标（LPIPS， TOPIQ）提高了（2.45%,23.52%）。代码可在https://github.com/ah-ke/VD-GAN.git上获得。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Pattern Recognition 工程技术-工程：电子与电气

CiteScore

14.40

自引率

16.20%

发文量

683

审稿时长

5.6 months

期刊介绍： The field of Pattern Recognition is both mature and rapidly evolving, playing a crucial role in various related fields such as computer vision, image processing, text analysis, and neural networks. It closely intersects with machine learning and is being applied in emerging areas like biometrics, bioinformatics, multimedia data analysis, and data science. The journal Pattern Recognition, established half a century ago during the early days of computer science, has since grown significantly in scope and influence.