Diffusion Model is Secretly a Training-Free Open Vocabulary Semantic Segmenter

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society Pub Date : 2025-03-24 DOI:10.1109/TIP.2025.3551648

Jinglong Wang;Xiawei Li;Jing Zhang;Qingyuan Xu;Qin Zhou;Qian Yu;Lu Sheng;Dong Xu

{"title":"Diffusion Model is Secretly a Training-Free Open Vocabulary Semantic Segmenter","authors":"Jinglong Wang;Xiawei Li;Jing Zhang;Qingyuan Xu;Qin Zhou;Qian Yu;Lu Sheng;Dong Xu","doi":"10.1109/TIP.2025.3551648","DOIUrl":null,"url":null,"abstract":"The pre-trained text-image discriminative models, such as CLIP, has been explored for open-vocabulary semantic segmentation with unsatisfactory results due to the loss of crucial localization information and awareness of object shapes. Recently, there has been a growing interest in expanding the application of generative models from generation tasks to semantic segmentation. These approaches utilize generative models either for generating annotated data or extracting features to facilitate semantic segmentation. This typically involves generating a considerable amount of synthetic data or requiring additional mask annotations. To this end, we uncover the potential of generative text-to-image diffusion models (e.g., Stable Diffusion) as highly efficient open-vocabulary semantic segmenters, and introduce a novel training-free approach named DiffSegmenter. The insight is that to generate realistic objects that are semantically faithful to the input text, both the complete object shapes and the corresponding semantics are implicitly learned by diffusion models. We discover that the object shapes are characterized by the self-attention maps while the semantics are indicated through the cross-attention maps produced by the denoising U-Net, forming the basis of our segmentation results. Additionally, we carefully design effective textual prompts and a category filtering mechanism to further enhance the segmentation results. Extensive experiments on three benchmark datasets show that the proposed DiffSegmenter achieves impressive results for open-vocabulary semantic segmentation.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"1895-1907"},"PeriodicalIF":0.0000,"publicationDate":"2025-03-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10938258/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

The pre-trained text-image discriminative models, such as CLIP, has been explored for open-vocabulary semantic segmentation with unsatisfactory results due to the loss of crucial localization information and awareness of object shapes. Recently, there has been a growing interest in expanding the application of generative models from generation tasks to semantic segmentation. These approaches utilize generative models either for generating annotated data or extracting features to facilitate semantic segmentation. This typically involves generating a considerable amount of synthetic data or requiring additional mask annotations. To this end, we uncover the potential of generative text-to-image diffusion models (e.g., Stable Diffusion) as highly efficient open-vocabulary semantic segmenters, and introduce a novel training-free approach named DiffSegmenter. The insight is that to generate realistic objects that are semantically faithful to the input text, both the complete object shapes and the corresponding semantics are implicitly learned by diffusion models. We discover that the object shapes are characterized by the self-attention maps while the semantics are indicated through the cross-attention maps produced by the denoising U-Net, forming the basis of our segmentation results. Additionally, we carefully design effective textual prompts and a category filtering mechanism to further enhance the segmentation results. Extensive experiments on three benchmark datasets show that the proposed DiffSegmenter achieves impressive results for open-vocabulary semantic segmentation.

查看原文本刊更多论文

扩散模型实质上是一种无需训练的开放词汇语义切分器。

预训练的文本-图像判别模型，如CLIP，已经被用于开放词汇语义分割，但由于丢失了关键的定位信息和物体形状的感知，结果不理想。近年来，人们对将生成模型的应用从生成任务扩展到语义分割越来越感兴趣。这些方法利用生成模型生成带注释的数据或提取特征以促进语义分割。这通常涉及生成相当数量的合成数据或需要额外的掩码注释。为此，我们揭示了生成文本到图像扩散模型（例如，稳定扩散）作为高效开放词汇语义切分器的潜力，并引入了一种名为DiffSegmenter的新型无训练方法。我们的见解是，为了生成语义上忠实于输入文本的逼真对象，完整的对象形状和相应的语义都是通过扩散模型隐式学习的。我们发现物体的形状是由自注意图来表征的，而语义是由去噪的U-Net产生的交叉注意图来表示的，这构成了我们分割结果的基础。此外，我们精心设计了有效的文本提示和类别过滤机制，以进一步提高分割结果。在三个基准数据集上的大量实验表明，所提出的DiffSegmenter在开放词汇语义分割方面取得了令人印象深刻的结果。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society

自引率

0.00%

发文量