Local optimization cropping and boundary enhancement for end-to-end weakly-supervised segmentation network

IF 4.3 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Vision and Image Understanding Pub Date : 2025-02-01 DOI:10.1016/j.cviu.2024.104260

Weizheng Wang, Chao Zeng, Haonan Wang, Lei Zhou

{"title":"Local optimization cropping and boundary enhancement for end-to-end weakly-supervised segmentation network","authors":"Weizheng Wang, Chao Zeng, Haonan Wang, Lei Zhou","doi":"10.1016/j.cviu.2024.104260","DOIUrl":null,"url":null,"abstract":"<div><div>In recent years, the performance of weakly-supervised semantic segmentation(WSSS) has significantly increased. It usually employs image-level labels to generate Class Activation Map (CAM) for producing pseudo-labels, which greatly reduces the cost of annotation. Since CNN cannot fully identify object regions, researchers found that Vision Transformers (ViT) can complement the deficiencies of CNN by better extracting global contextual information. However, ViT also introduces the problem of over-smoothing. Great progress has been made in recent years to solve the over-smoothing problem, yet two issues remain. The first issue is that the high-confidence regions in the network-generated CAM still contain areas irrelevant to the class. The second issue is the inaccuracy of CAM boundaries, which contain a small portion of background regions. As we know, the precision of label boundaries is closely tied to excellent segmentation performance. In this work, to address the first issue, we propose a local optimized cropping module (LOC). By randomly cropping selected regions, we allow the local class tokens to be contrasted with the global class tokens. This method facilitates enhanced consistency between local and global representations. To address the second issue, we design a boundary enhancement module (BE) that utilizes an erasing strategy to re-train the image, increasing the network’s extraction of boundary information and greatly improving the accuracy of CAM boundaries, thereby enhancing the quality of pseudo labels. Experiments on the PASCAL VOC dataset show that the performance of our proposed LOC-BE Net outperforms multi-stage methods and is competitive with end-to-end methods. On the PASCAL VOC dataset, our method achieves a CAM mIoU of 74.2% and a segmentation mIoU of 73.1%. On the COCO2014 dataset, our method achieves a CAM mIoU of 43.8% and a segmentation mIoU of 43.4%. Our code has been open sourced: <span><span>https://github.com/whn786/LOC-BE/tree/main</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"251 ","pages":"Article 104260"},"PeriodicalIF":4.3000,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Vision and Image Understanding","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1077314224003412","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

In recent years, the performance of weakly-supervised semantic segmentation(WSSS) has significantly increased. It usually employs image-level labels to generate Class Activation Map (CAM) for producing pseudo-labels, which greatly reduces the cost of annotation. Since CNN cannot fully identify object regions, researchers found that Vision Transformers (ViT) can complement the deficiencies of CNN by better extracting global contextual information. However, ViT also introduces the problem of over-smoothing. Great progress has been made in recent years to solve the over-smoothing problem, yet two issues remain. The first issue is that the high-confidence regions in the network-generated CAM still contain areas irrelevant to the class. The second issue is the inaccuracy of CAM boundaries, which contain a small portion of background regions. As we know, the precision of label boundaries is closely tied to excellent segmentation performance. In this work, to address the first issue, we propose a local optimized cropping module (LOC). By randomly cropping selected regions, we allow the local class tokens to be contrasted with the global class tokens. This method facilitates enhanced consistency between local and global representations. To address the second issue, we design a boundary enhancement module (BE) that utilizes an erasing strategy to re-train the image, increasing the network’s extraction of boundary information and greatly improving the accuracy of CAM boundaries, thereby enhancing the quality of pseudo labels. Experiments on the PASCAL VOC dataset show that the performance of our proposed LOC-BE Net outperforms multi-stage methods and is competitive with end-to-end methods. On the PASCAL VOC dataset, our method achieves a CAM mIoU of 74.2% and a segmentation mIoU of 73.1%. On the COCO2014 dataset, our method achieves a CAM mIoU of 43.8% and a segmentation mIoU of 43.4%. Our code has been open sourced: https://github.com/whn786/LOC-BE/tree/main.

查看原文本刊更多论文

端到端弱监督分割网络的局部优化裁剪与边界增强

近年来，弱监督语义分割（WSSS）的性能有了显著提高。通常采用图像级标签生成类激活图（Class Activation Map， CAM）生成伪标签，大大降低了标注成本。由于CNN不能完全识别目标区域，研究人员发现视觉变形（Vision Transformers, ViT）可以通过更好地提取全局上下文信息来弥补CNN的不足。然而，ViT也引入了过度平滑的问题。近年来，在解决过平滑问题方面取得了很大进展，但仍存在两个问题。第一个问题是网络生成的CAM中的高置信度区域仍然包含与类无关的区域。第二个问题是CAM边界的不准确性，它包含了一小部分背景区域。正如我们所知，标签边界的精度与出色的分割性能密切相关。在这项工作中，为了解决第一个问题，我们提出了一个局部优化裁剪模块（LOC）。通过随机裁剪选定的区域，我们允许将局部类标记与全局类标记进行对比。这种方法有助于增强本地表示和全局表示之间的一致性。为了解决第二个问题，我们设计了一个边界增强模块（BE），该模块利用擦除策略对图像进行重新训练，增加了网络对边界信息的提取，大大提高了CAM边界的准确性，从而提高了伪标签的质量。在PASCAL VOC数据集上的实验表明，我们提出的LOC-BE网络的性能优于多阶段方法，并且与端到端方法具有竞争力。在PASCAL VOC数据集上，我们的方法实现了74.2%的CAM mIoU和73.1%的分割mIoU。在COCO2014数据集上，我们的方法实现了43.8%的CAM mIoU和43.4%的分割mIoU。我们的代码是开源的：https://github.com/whn786/LOC-BE/tree/main。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Computer Vision and Image Understanding 工程技术-工程：电子与电气

CiteScore

7.80

自引率

4.40%

发文量

112

审稿时长

79 days

期刊介绍： The central focus of this journal is the computer analysis of pictorial information. Computer Vision and Image Understanding publishes papers covering all aspects of image analysis from the low-level, iconic processes of early vision to the high-level, symbolic processes of recognition and interpretation. A wide range of topics in the image understanding area is covered, including papers offering insights that differ from predominant views. Research Areas Include: • Theory • Early vision • Data structures and representations • Shape • Range • Motion • Matching and recognition • Architecture and languages • Vision systems