Xiangfu Ding , Youjia Shao , Na Tian , Li Wang , Wencang Zhao
{"title":"Counterfactual learning and saliency augmentation for weakly supervised semantic segmentation","authors":"Xiangfu Ding , Youjia Shao , Na Tian , Li Wang , Wencang Zhao","doi":"10.1016/j.imavis.2025.105523","DOIUrl":null,"url":null,"abstract":"<div><div>The weakly supervised semantic segmentation based on image-level annotation has garnered widespread attention due to its excellent annotation efficiency and remarkable scalability. Numerous studies have utilized class activation maps generated by classification networks to produce pseudo-labels and train segmentation models accordingly. However, these methods exhibit certain limitations: biased localization activations, co-occurrence from the background, and semantic absence of target objects. We re-examine the aforementioned issues from a causal perspective and propose a framework for CounterFactual Learning and Saliency Augmentation (CFLSA) based on causal inference. CFLSA consists of a debiased causal chain and a positional causal chain. The debiased causal chain, through counterfactual decoupling generation module, compels the model to focus on constant target features while disregarding background features. It effectively eliminates spurious correlations between foreground objects and the background. Additionally, issues of biased activation and co-occurring pixel are alleviated. Secondly, in order to enable the model to recognize more comprehensive semantic information, we introduce a saliency augmentation mechanism in the positional causal chain to dynamically perceive foreground objects and background information. It can facilitate pixel-level feedback, leading to improved segmentation performance. With the collaboration of both chains, CFLSA achieves advanced results on the PASCAL VOC 2012 and MS COCO 2014 datasets.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"158 ","pages":"Article 105523"},"PeriodicalIF":4.2000,"publicationDate":"2025-03-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Image and Vision Computing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0262885625001118","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
The weakly supervised semantic segmentation based on image-level annotation has garnered widespread attention due to its excellent annotation efficiency and remarkable scalability. Numerous studies have utilized class activation maps generated by classification networks to produce pseudo-labels and train segmentation models accordingly. However, these methods exhibit certain limitations: biased localization activations, co-occurrence from the background, and semantic absence of target objects. We re-examine the aforementioned issues from a causal perspective and propose a framework for CounterFactual Learning and Saliency Augmentation (CFLSA) based on causal inference. CFLSA consists of a debiased causal chain and a positional causal chain. The debiased causal chain, through counterfactual decoupling generation module, compels the model to focus on constant target features while disregarding background features. It effectively eliminates spurious correlations between foreground objects and the background. Additionally, issues of biased activation and co-occurring pixel are alleviated. Secondly, in order to enable the model to recognize more comprehensive semantic information, we introduce a saliency augmentation mechanism in the positional causal chain to dynamically perceive foreground objects and background information. It can facilitate pixel-level feedback, leading to improved segmentation performance. With the collaboration of both chains, CFLSA achieves advanced results on the PASCAL VOC 2012 and MS COCO 2014 datasets.
期刊介绍:
Image and Vision Computing has as a primary aim the provision of an effective medium of interchange for the results of high quality theoretical and applied research fundamental to all aspects of image interpretation and computer vision. The journal publishes work that proposes new image interpretation and computer vision methodology or addresses the application of such methods to real world scenes. It seeks to strengthen a deeper understanding in the discipline by encouraging the quantitative comparison and performance evaluation of the proposed methodology. The coverage includes: image interpretation, scene modelling, object recognition and tracking, shape analysis, monitoring and surveillance, active vision and robotic systems, SLAM, biologically-inspired computer vision, motion analysis, stereo vision, document image understanding, character and handwritten text recognition, face and gesture recognition, biometrics, vision-based human-computer interaction, human activity and behavior understanding, data fusion from multiple sensor inputs, image databases.