ESDA：基于嵌入语义空间分布调整策略的零射击语义分割

IF 4.2 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Image and Vision Computing Pub Date : 2025-02-13 DOI:10.1016/j.imavis.2025.105456

Jiaguang Li, Ying Wei, Wei Zhang, Chuyuan Wang

{"title":"ESDA：基于嵌入语义空间分布调整策略的零射击语义分割","authors":"Jiaguang Li, Ying Wei, Wei Zhang, Chuyuan Wang","doi":"10.1016/j.imavis.2025.105456","DOIUrl":null,"url":null,"abstract":"<div><div>Recently, the CLIP model, which is pre-trained on large-scale vision-language data, has promoted the development of zero-shot recognition tasks. Some researchers apply CLIP to zero-shot semantic segmentation, but they often struggle to achieve satisfactory results. This is because this dense prediction task requires not only a precise understanding of semantics, but also a precise perception of different regions within one image. However, CLIP is trained on image-level vision-language data, resulting in ineffective perception of pixel-level regions. In this paper, we propose a new zero-shot semantic segmentation (ZS3) method based on an embedding semantic space distribution adjustment strategy (ESDA), which enables CLIP to accurately perceive both semantics and regions. This method inserts additional trainable blocks into the CLIP image encoder, enabling it to effectively perceive regions without losing semantic understanding. Besides, we design spatial distribution losses to guide the update of parameters of the trainable blocks, thereby further enhancing the regional characteristics of pixel-level image embeddings. In addition, previous methods only obtain semantic support through a text [CLS] token, which is far from sufficient for the dense prediction task. Therefore, we design a vision-language embedding interactor, which can obtain richer semantic support through the interaction between the entire text embedding and image embedding. It can also further enhance the semantic support and strengthen the image embedding. Plenty of experiments on PASCAL-<span><math><msup><mrow><mn>5</mn></mrow><mrow><mi>i</mi></mrow></msup></math></span> and COCO-<span><math><mrow><mn>2</mn><msup><mrow><mn>0</mn></mrow><mrow><mi>i</mi></mrow></msup></mrow></math></span> prove the effectiveness of our method. Our method achieves new state-of-the-art for zero-shot semantic segmentation and exceeds many few-shot semantic segmentation methods. Codes are available at <span><span>https://github.com/Jiaguang-NEU/ESDA</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"155 ","pages":"Article 105456"},"PeriodicalIF":4.2000,"publicationDate":"2025-02-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"ESDA: Zero-shot semantic segmentation based on an embedding semantic space distribution adjustment strategy\",\"authors\":\"Jiaguang Li, Ying Wei, Wei Zhang, Chuyuan Wang\",\"doi\":\"10.1016/j.imavis.2025.105456\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Recently, the CLIP model, which is pre-trained on large-scale vision-language data, has promoted the development of zero-shot recognition tasks. Some researchers apply CLIP to zero-shot semantic segmentation, but they often struggle to achieve satisfactory results. This is because this dense prediction task requires not only a precise understanding of semantics, but also a precise perception of different regions within one image. However, CLIP is trained on image-level vision-language data, resulting in ineffective perception of pixel-level regions. In this paper, we propose a new zero-shot semantic segmentation (ZS3) method based on an embedding semantic space distribution adjustment strategy (ESDA), which enables CLIP to accurately perceive both semantics and regions. This method inserts additional trainable blocks into the CLIP image encoder, enabling it to effectively perceive regions without losing semantic understanding. Besides, we design spatial distribution losses to guide the update of parameters of the trainable blocks, thereby further enhancing the regional characteristics of pixel-level image embeddings. In addition, previous methods only obtain semantic support through a text [CLS] token, which is far from sufficient for the dense prediction task. Therefore, we design a vision-language embedding interactor, which can obtain richer semantic support through the interaction between the entire text embedding and image embedding. It can also further enhance the semantic support and strengthen the image embedding. Plenty of experiments on PASCAL-<span><math><msup><mrow><mn>5</mn></mrow><mrow><mi>i</mi></mrow></msup></math></span> and COCO-<span><math><mrow><mn>2</mn><msup><mrow><mn>0</mn></mrow><mrow><mi>i</mi></mrow></msup></mrow></math></span> prove the effectiveness of our method. Our method achieves new state-of-the-art for zero-shot semantic segmentation and exceeds many few-shot semantic segmentation methods. Codes are available at <span><span>https://github.com/Jiaguang-NEU/ESDA</span><svg><path></path></svg></span>.</div></div>\",\"PeriodicalId\":50374,\"journal\":{\"name\":\"Image and Vision Computing\",\"volume\":\"155 \",\"pages\":\"Article 105456\"},\"PeriodicalIF\":4.2000,\"publicationDate\":\"2025-02-13\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Image and Vision Computing\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0262885625000447\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Image and Vision Computing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0262885625000447","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

近年来，在大规模视觉语言数据上进行预训练的CLIP模型推动了零射击识别任务的发展。一些研究人员将CLIP应用于零射击语义分割，但往往难以达到令人满意的结果。这是因为这种密集预测任务不仅需要对语义的精确理解，还需要对一张图像内不同区域的精确感知。然而，CLIP是在图像级视觉语言数据上训练的，导致对像素级区域的感知无效。本文提出了一种新的基于嵌入语义空间分布调整策略（ESDA）的零射击语义分割（ZS3）方法，使CLIP能够准确地感知语义和区域。该方法在CLIP图像编码器中插入额外的可训练块，使其能够在不丢失语义理解的情况下有效地感知区域。此外，我们设计空间分布损失来指导可训练块参数的更新，从而进一步增强像素级图像嵌入的区域特征。此外，以前的方法仅通过文本[CLS]令牌获得语义支持，这对于密集预测任务是远远不够的。因此，我们设计了一个视觉语言嵌入交互器，通过整个文本嵌入和图像嵌入之间的交互，可以获得更丰富的语义支持。它还可以进一步增强语义支持，增强图像嵌入。PASCAL-5i和COCO-20i上的大量实验证明了该方法的有效性。该方法实现了零镜头语义分割的新技术，超越了许多少镜头语义分割方法。代码可在https://github.com/Jiaguang-NEU/ESDA上获得。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

ESDA: Zero-shot semantic segmentation based on an embedding semantic space distribution adjustment strategy

Recently, the CLIP model, which is pre-trained on large-scale vision-language data, has promoted the development of zero-shot recognition tasks. Some researchers apply CLIP to zero-shot semantic segmentation, but they often struggle to achieve satisfactory results. This is because this dense prediction task requires not only a precise understanding of semantics, but also a precise perception of different regions within one image. However, CLIP is trained on image-level vision-language data, resulting in ineffective perception of pixel-level regions. In this paper, we propose a new zero-shot semantic segmentation (ZS3) method based on an embedding semantic space distribution adjustment strategy (ESDA), which enables CLIP to accurately perceive both semantics and regions. This method inserts additional trainable blocks into the CLIP image encoder, enabling it to effectively perceive regions without losing semantic understanding. Besides, we design spatial distribution losses to guide the update of parameters of the trainable blocks, thereby further enhancing the regional characteristics of pixel-level image embeddings. In addition, previous methods only obtain semantic support through a text [CLS] token, which is far from sufficient for the dense prediction task. Therefore, we design a vision-language embedding interactor, which can obtain richer semantic support through the interaction between the entire text embedding and image embedding. It can also further enhance the semantic support and strengthen the image embedding. Plenty of experiments on PASCAL-

5^{i}

and COCO-

2 0^{i}

prove the effectiveness of our method. Our method achieves new state-of-the-art for zero-shot semantic segmentation and exceeds many few-shot semantic segmentation methods. Codes are available at https://github.com/Jiaguang-NEU/ESDA.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Image and Vision Computing 工程技术-工程：电子与电气

CiteScore

8.50

自引率

8.50%

发文量

143

审稿时长

7.8 months

期刊介绍： Image and Vision Computing has as a primary aim the provision of an effective medium of interchange for the results of high quality theoretical and applied research fundamental to all aspects of image interpretation and computer vision. The journal publishes work that proposes new image interpretation and computer vision methodology or addresses the application of such methods to real world scenes. It seeks to strengthen a deeper understanding in the discipline by encouraging the quantitative comparison and performance evaluation of the proposed methodology. The coverage includes: image interpretation, scene modelling, object recognition and tracking, shape analysis, monitoring and surveillance, active vision and robotic systems, SLAM, biologically-inspired computer vision, motion analysis, stereo vision, document image understanding, character and handwritten text recognition, face and gesture recognition, biometrics, vision-based human-computer interaction, human activity and behavior understanding, data fusion from multiple sensor inputs, image databases.