弱监督语义分割的反事实学习和显著性增强

IF 4.2 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Image and Vision Computing Pub Date : 2025-03-31 DOI:10.1016/j.imavis.2025.105523

Xiangfu Ding , Youjia Shao , Na Tian , Li Wang , Wencang Zhao

{"title":"弱监督语义分割的反事实学习和显著性增强","authors":"Xiangfu Ding , Youjia Shao , Na Tian , Li Wang , Wencang Zhao","doi":"10.1016/j.imavis.2025.105523","DOIUrl":null,"url":null,"abstract":"<div><div>The weakly supervised semantic segmentation based on image-level annotation has garnered widespread attention due to its excellent annotation efficiency and remarkable scalability. Numerous studies have utilized class activation maps generated by classification networks to produce pseudo-labels and train segmentation models accordingly. However, these methods exhibit certain limitations: biased localization activations, co-occurrence from the background, and semantic absence of target objects. We re-examine the aforementioned issues from a causal perspective and propose a framework for CounterFactual Learning and Saliency Augmentation (CFLSA) based on causal inference. CFLSA consists of a debiased causal chain and a positional causal chain. The debiased causal chain, through counterfactual decoupling generation module, compels the model to focus on constant target features while disregarding background features. It effectively eliminates spurious correlations between foreground objects and the background. Additionally, issues of biased activation and co-occurring pixel are alleviated. Secondly, in order to enable the model to recognize more comprehensive semantic information, we introduce a saliency augmentation mechanism in the positional causal chain to dynamically perceive foreground objects and background information. It can facilitate pixel-level feedback, leading to improved segmentation performance. With the collaboration of both chains, CFLSA achieves advanced results on the PASCAL VOC 2012 and MS COCO 2014 datasets.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"158 ","pages":"Article 105523"},"PeriodicalIF":4.2000,"publicationDate":"2025-03-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Counterfactual learning and saliency augmentation for weakly supervised semantic segmentation\",\"authors\":\"Xiangfu Ding , Youjia Shao , Na Tian , Li Wang , Wencang Zhao\",\"doi\":\"10.1016/j.imavis.2025.105523\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>The weakly supervised semantic segmentation based on image-level annotation has garnered widespread attention due to its excellent annotation efficiency and remarkable scalability. Numerous studies have utilized class activation maps generated by classification networks to produce pseudo-labels and train segmentation models accordingly. However, these methods exhibit certain limitations: biased localization activations, co-occurrence from the background, and semantic absence of target objects. We re-examine the aforementioned issues from a causal perspective and propose a framework for CounterFactual Learning and Saliency Augmentation (CFLSA) based on causal inference. CFLSA consists of a debiased causal chain and a positional causal chain. The debiased causal chain, through counterfactual decoupling generation module, compels the model to focus on constant target features while disregarding background features. It effectively eliminates spurious correlations between foreground objects and the background. Additionally, issues of biased activation and co-occurring pixel are alleviated. Secondly, in order to enable the model to recognize more comprehensive semantic information, we introduce a saliency augmentation mechanism in the positional causal chain to dynamically perceive foreground objects and background information. It can facilitate pixel-level feedback, leading to improved segmentation performance. With the collaboration of both chains, CFLSA achieves advanced results on the PASCAL VOC 2012 and MS COCO 2014 datasets.</div></div>\",\"PeriodicalId\":50374,\"journal\":{\"name\":\"Image and Vision Computing\",\"volume\":\"158 \",\"pages\":\"Article 105523\"},\"PeriodicalIF\":4.2000,\"publicationDate\":\"2025-03-31\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Image and Vision Computing\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0262885625001118\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Image and Vision Computing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0262885625001118","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

基于图像级标注的弱监督语义分割以其优异的标注效率和显著的可扩展性受到了广泛的关注。许多研究利用分类网络生成的类激活图来生成伪标签和相应的训练分割模型。然而，这些方法有一定的局限性：有偏差的定位激活、背景的共现和目标对象的语义缺失。我们从因果关系的角度重新审视了上述问题，并提出了一个基于因果推理的反事实学习和显著性增强（CFLSA）框架。CFLSA由去偏因果链和位置因果链组成。去偏因果链通过反事实解耦生成模块，迫使模型关注恒定的目标特征而忽略背景特征。它有效地消除了前景对象和背景之间的虚假相关性。此外，还缓解了偏激活和共存像素的问题。其次，为了使模型能够识别更全面的语义信息，我们在位置因果链中引入显著性增强机制，动态感知前景对象和背景信息。它可以促进像素级反馈，从而提高分割性能。通过双方的合作，CFLSA在PASCAL VOC 2012和MS COCO 2014数据集上取得了先进的成果。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Counterfactual learning and saliency augmentation for weakly supervised semantic segmentation

The weakly supervised semantic segmentation based on image-level annotation has garnered widespread attention due to its excellent annotation efficiency and remarkable scalability. Numerous studies have utilized class activation maps generated by classification networks to produce pseudo-labels and train segmentation models accordingly. However, these methods exhibit certain limitations: biased localization activations, co-occurrence from the background, and semantic absence of target objects. We re-examine the aforementioned issues from a causal perspective and propose a framework for CounterFactual Learning and Saliency Augmentation (CFLSA) based on causal inference. CFLSA consists of a debiased causal chain and a positional causal chain. The debiased causal chain, through counterfactual decoupling generation module, compels the model to focus on constant target features while disregarding background features. It effectively eliminates spurious correlations between foreground objects and the background. Additionally, issues of biased activation and co-occurring pixel are alleviated. Secondly, in order to enable the model to recognize more comprehensive semantic information, we introduce a saliency augmentation mechanism in the positional causal chain to dynamically perceive foreground objects and background information. It can facilitate pixel-level feedback, leading to improved segmentation performance. With the collaboration of both chains, CFLSA achieves advanced results on the PASCAL VOC 2012 and MS COCO 2014 datasets.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Image and Vision Computing 工程技术-工程：电子与电气

CiteScore

8.50

自引率

8.50%

发文量

143

审稿时长

7.8 months

期刊介绍： Image and Vision Computing has as a primary aim the provision of an effective medium of interchange for the results of high quality theoretical and applied research fundamental to all aspects of image interpretation and computer vision. The journal publishes work that proposes new image interpretation and computer vision methodology or addresses the application of such methods to real world scenes. It seeks to strengthen a deeper understanding in the discipline by encouraging the quantitative comparison and performance evaluation of the proposed methodology. The coverage includes: image interpretation, scene modelling, object recognition and tracking, shape analysis, monitoring and surveillance, active vision and robotic systems, SLAM, biologically-inspired computer vision, motion analysis, stereo vision, document image understanding, character and handwritten text recognition, face and gesture recognition, biometrics, vision-based human-computer interaction, human activity and behavior understanding, data fusion from multiple sensor inputs, image databases.