Style transfer with diffusion models for synthetic-to-real domain adaptation

IF 3.5 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Vision and Image Understanding Pub Date : 2025-07-14 DOI:10.1016/j.cviu.2025.104445

Estelle Chigot , Dennis G. Wilson , Meriem Ghrib , Thomas Oberlin

{"title":"Style transfer with diffusion models for synthetic-to-real domain adaptation","authors":"Estelle Chigot , Dennis G. Wilson , Meriem Ghrib , Thomas Oberlin","doi":"10.1016/j.cviu.2025.104445","DOIUrl":null,"url":null,"abstract":"<div><div>Semantic segmentation models trained on synthetic data often perform poorly on real-world images due to domain gaps, particularly in adverse conditions where labeled data is scarce. Yet, recent foundation models enable to generate realistic images without any training. This paper proposes to leverage such diffusion models to improve the performance of vision models when learned on synthetic data. We introduce two novel techniques for semantically consistent style transfer using diffusion models: Class-wise Adaptive Instance Normalization and Cross-Attention (<span><math><mtext>CACTI</mtext></math></span>) and its extension with selective attention Filtering (<span><math><msub><mrow><mtext>CACTI</mtext></mrow><mrow><mtext>F</mtext></mrow></msub></math></span>). <span><math><mtext>CACTI</mtext></math></span> applies statistical normalization selectively based on semantic classes, while <span><math><msub><mrow><mtext>CACTI</mtext></mrow><mrow><mtext>F</mtext></mrow></msub></math></span> further filters cross-attention maps based on feature similarity, preventing artifacts in regions with weak cross-attention correspondences. Our methods transfer style characteristics while preserving semantic boundaries and structural coherence, unlike approaches that apply global transformations or generate content without constraints. Experiments using GTA5 as source and Cityscapes/ACDC as target domains show that our approach produces higher quality images with lower FID scores and better content preservation. Our work demonstrates that class-aware diffusion-based style transfer effectively bridges the synthetic-to-real domain gap even with minimal target domain data, advancing robust perception systems for challenging real-world applications. The source code is available at: <span><span>https://github.com/echigot/cactif</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"259 ","pages":"Article 104445"},"PeriodicalIF":3.5000,"publicationDate":"2025-07-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Vision and Image Understanding","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1077314225001687","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Semantic segmentation models trained on synthetic data often perform poorly on real-world images due to domain gaps, particularly in adverse conditions where labeled data is scarce. Yet, recent foundation models enable to generate realistic images without any training. This paper proposes to leverage such diffusion models to improve the performance of vision models when learned on synthetic data. We introduce two novel techniques for semantically consistent style transfer using diffusion models: Class-wise Adaptive Instance Normalization and Cross-Attention (

CACTI

) and its extension with selective attention Filtering (

{CACTI}_{F}

CACTI

applies statistical normalization selectively based on semantic classes, while

{CACTI}_{F}

further filters cross-attention maps based on feature similarity, preventing artifacts in regions with weak cross-attention correspondences. Our methods transfer style characteristics while preserving semantic boundaries and structural coherence, unlike approaches that apply global transformations or generate content without constraints. Experiments using GTA5 as source and Cityscapes/ACDC as target domains show that our approach produces higher quality images with lower FID scores and better content preservation. Our work demonstrates that class-aware diffusion-based style transfer effectively bridges the synthetic-to-real domain gap even with minimal target domain data, advancing robust perception systems for challenging real-world applications. The source code is available at: https://github.com/echigot/cactif.

查看原文本刊更多论文

基于扩散模型的风格迁移，用于综合到实际领域的适应

在合成数据上训练的语义分割模型通常由于领域差距而在真实图像上表现不佳，特别是在标记数据稀缺的不利条件下。然而，最近的基础模型可以在没有任何训练的情况下生成逼真的图像。本文提出利用这种扩散模型来提高视觉模型在合成数据上学习的性能。我们介绍了两种使用扩散模型进行语义一致风格迁移的新技术：类自适应实例规范化和交叉注意（CACTI）及其扩展与选择性注意过滤（CACTIF）。CACTI基于语义类选择性地应用统计归一化，而CACTIF则基于特征相似性进一步过滤交叉注意映射，防止交叉注意对应程度较弱的区域出现伪象。我们的方法在保留语义边界和结构一致性的同时转移风格特征，不像应用全局转换或不受约束地生成内容的方法。以GTA5为源，cityscape /ACDC为目标域的实验表明，我们的方法可以产生更高质量的图像，FID分数更低，内容保存更好。我们的工作表明，基于类感知扩散的风格迁移即使在最小的目标领域数据下也能有效地弥合合成到真实领域的差距，为具有挑战性的现实世界应用推进鲁棒感知系统。源代码可从https://github.com/echigot/cactif获得。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Computer Vision and Image Understanding 工程技术-工程：电子与电气

CiteScore

7.80

自引率

4.40%

发文量

112

审稿时长

79 days

期刊介绍： The central focus of this journal is the computer analysis of pictorial information. Computer Vision and Image Understanding publishes papers covering all aspects of image analysis from the low-level, iconic processes of early vision to the high-level, symbolic processes of recognition and interpretation. A wide range of topics in the image understanding area is covered, including papers offering insights that differ from predominant views. Research Areas Include: • Theory • Early vision • Data structures and representations • Shape • Range • Motion • Matching and recognition • Architecture and languages • Vision systems