Generative Prompt Controlled Diffusion for weakly supervised semantic segmentation

IF 5.5 2区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Neurocomputing Pub Date : 2025-04-04 DOI:10.1016/j.neucom.2025.130103

Wangyu Wu , Tianhong Dai , Zhenhong Chen , Xiaowei Huang , Fei Ma , Jimin Xiao

{"title":"Generative Prompt Controlled Diffusion for weakly supervised semantic segmentation","authors":"Wangyu Wu , Tianhong Dai , Zhenhong Chen , Xiaowei Huang , Fei Ma , Jimin Xiao","doi":"10.1016/j.neucom.2025.130103","DOIUrl":null,"url":null,"abstract":"<div><div>Weakly supervised semantic segmentation (WSSS), aiming to train segmentation models solely using image-level labels, has received significant attention. Existing approaches mainly concentrate on creating high-quality pseudo labels by utilizing existing images and their corresponding image-level labels. However, <strong><em>a major challenge</em></strong> arises when the available dataset is limited, as the quality of pseudo labels degrades significantly. In this paper, we tackle this challenge from a different perspective by introducing a novel approach called <em>Generative Prompt Controlled Diffusion</em> (GPCD) for data augmentation. This approach enhances the current labeled datasets by augmenting them with a variety of images, achieved through controlled diffusion guided by Generative Pre-trained Transformer (GPT) prompts. In this process, the existing images and image-level labels provide the necessary control information, while GPT enriches the prompts to generate diverse backgrounds. Moreover, we make an <strong><em>original contribution</em></strong> by integrating data source information as tokens into the Vision Transformer (ViT) framework, which improves the ability of downstream WSSS models to recognize the origins of augmented images. Our proposed GPCD approach clearly surpasses existing state-of-the-art methods, with its advantages being more pronounced when the available data is scarce, thereby demonstrating the effectiveness of our method. Our source code will be released.</div></div>","PeriodicalId":19268,"journal":{"name":"Neurocomputing","volume":"638 ","pages":"Article 130103"},"PeriodicalIF":5.5000,"publicationDate":"2025-04-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Neurocomputing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0925231225007751","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Weakly supervised semantic segmentation (WSSS), aiming to train segmentation models solely using image-level labels, has received significant attention. Existing approaches mainly concentrate on creating high-quality pseudo labels by utilizing existing images and their corresponding image-level labels. However, a major challenge arises when the available dataset is limited, as the quality of pseudo labels degrades significantly. In this paper, we tackle this challenge from a different perspective by introducing a novel approach called Generative Prompt Controlled Diffusion (GPCD) for data augmentation. This approach enhances the current labeled datasets by augmenting them with a variety of images, achieved through controlled diffusion guided by Generative Pre-trained Transformer (GPT) prompts. In this process, the existing images and image-level labels provide the necessary control information, while GPT enriches the prompts to generate diverse backgrounds. Moreover, we make an original contribution by integrating data source information as tokens into the Vision Transformer (ViT) framework, which improves the ability of downstream WSSS models to recognize the origins of augmented images. Our proposed GPCD approach clearly surpasses existing state-of-the-art methods, with its advantages being more pronounced when the available data is scarce, thereby demonstrating the effectiveness of our method. Our source code will be released.

查看原文本刊更多论文

弱监督语义分割（WSSS）旨在仅使用图像级标签来训练分割模型，已受到广泛关注。现有方法主要集中于利用现有图像及其相应的图像级标签创建高质量的伪标签。然而，当可用数据集有限时，伪标签的质量就会明显下降，这就带来了一个重大挑战。在本文中，我们从不同的角度来应对这一挑战，为数据增强引入了一种名为 "生成提示控制扩散"（GPCD）的新方法。这种方法通过在生成预训练变换器（GPT）提示的引导下进行受控扩散，用各种图像来增强当前的标记数据集。在这一过程中，现有图像和图像级标签提供了必要的控制信息，而 GPT 则丰富了提示信息，从而生成多样化的背景。此外，我们还将数据源信息作为标记整合到视觉转换器（ViT）框架中，从而提高了下游 WSSS 模型识别增强图像来源的能力，这是我们的原创性贡献。我们提出的 GPCD 方法明显超越了现有的先进方法，在可用数据稀缺的情况下，其优势更加明显，从而证明了我们方法的有效性。我们将发布源代码。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Neurocomputing 工程技术-计算机：人工智能

CiteScore

13.10

自引率

10.00%

发文量

1382

审稿时长

70 days

期刊介绍： Neurocomputing publishes articles describing recent fundamental contributions in the field of neurocomputing. Neurocomputing theory, practice and applications are the essential topics being covered.