Optimizing multi-task network with learned prototypes for weakly supervised semantic segmentation

IF 2.7 3区工程技术 Q2 ENGINEERING, ELECTRICAL & ELECTRONIC

Signal Processing-Image Communication Pub Date : 2025-01-17 DOI:10.1016/j.image.2025.117272

Lei Zhou , Jiasong Wang , Jing Luo , Yuheng Guo , Xiaoxiao Li

{"title":"Optimizing multi-task network with learned prototypes for weakly supervised semantic segmentation","authors":"Lei Zhou , Jiasong Wang , Jing Luo , Yuheng Guo , Xiaoxiao Li","doi":"10.1016/j.image.2025.117272","DOIUrl":null,"url":null,"abstract":"<div><div>Weakly supervised semantic segmentation (WSSS) presents a challenging task wherein semantic objects are extracted solely through the utilization of image-level labels as supervision. One common category of state-of-the-art solutions depends on the generation of pseudo pixel-level annotations via the use of localization maps. Nevertheless, in the majority of such solutions, the quality of pseudo annotations may not effectively fulfill the requirements of semantic segmentation owing to the incomplete nature of the localization maps. In order to generate denser localization maps for WSSS, this paper proposes the use of a prototype learning guided multi-task network. Initially, the prototypes (also referred to as prototypical feature vectors) are employed to depict the similarities between images. Specifically, the shared information among different training images is thoroughly exploited to concomitantly learn the prototypes for both foreground categories and background. This approach facilitates the localization of more reliable background pixels and foreground regions by evaluating the similarities between the representative prototypes and the extracted features of pixels. Additionally, the learned prototypes can be incorporated into the multi-task network to enhance the efficiency of parameter optimization by adaptively rectifying errors in pixel-level supervision. Therefore, the optimization of the multi-task network for object localization and the production of high-quality proxy annotations can be achieved by means of clean image-level labels and refined pixel-level supervision working in conjunction. By selecting and refining proxy annotations, the performance of the segmentation algorithm can be further improved. Extensive experiments conducted on two datasets, namely, PASCAL VOC 2012 and COCO 2014, have substantiated the fact that the prototype learning guided multi-task network being proposed outperforms the current state-of-the-art (SOTA) methods in terms of segmentation performance, achieving a mean IoU of 72.1% and 72.6% on the PASCAL VOC 2012 validation and test sets, respectively.</div></div>","PeriodicalId":49521,"journal":{"name":"Signal Processing-Image Communication","volume":"134 ","pages":"Article 117272"},"PeriodicalIF":2.7000,"publicationDate":"2025-01-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Signal Processing-Image Communication","FirstCategoryId":"5","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0923596525000190","RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}

引用次数: 0

Abstract

Weakly supervised semantic segmentation (WSSS) presents a challenging task wherein semantic objects are extracted solely through the utilization of image-level labels as supervision. One common category of state-of-the-art solutions depends on the generation of pseudo pixel-level annotations via the use of localization maps. Nevertheless, in the majority of such solutions, the quality of pseudo annotations may not effectively fulfill the requirements of semantic segmentation owing to the incomplete nature of the localization maps. In order to generate denser localization maps for WSSS, this paper proposes the use of a prototype learning guided multi-task network. Initially, the prototypes (also referred to as prototypical feature vectors) are employed to depict the similarities between images. Specifically, the shared information among different training images is thoroughly exploited to concomitantly learn the prototypes for both foreground categories and background. This approach facilitates the localization of more reliable background pixels and foreground regions by evaluating the similarities between the representative prototypes and the extracted features of pixels. Additionally, the learned prototypes can be incorporated into the multi-task network to enhance the efficiency of parameter optimization by adaptively rectifying errors in pixel-level supervision. Therefore, the optimization of the multi-task network for object localization and the production of high-quality proxy annotations can be achieved by means of clean image-level labels and refined pixel-level supervision working in conjunction. By selecting and refining proxy annotations, the performance of the segmentation algorithm can be further improved. Extensive experiments conducted on two datasets, namely, PASCAL VOC 2012 and COCO 2014, have substantiated the fact that the prototype learning guided multi-task network being proposed outperforms the current state-of-the-art (SOTA) methods in terms of segmentation performance, achieving a mean IoU of 72.1% and 72.6% on the PASCAL VOC 2012 validation and test sets, respectively.

查看原文本刊更多论文

基于学习原型的多任务网络弱监督语义分割优化

弱监督语义分割（WSSS）是一项具有挑战性的任务，其中仅通过使用图像级标签作为监督来提取语义对象。最先进的解决方案的一个常见类别依赖于通过使用本地化地图生成伪像素级注释。然而，在大多数此类解决方案中，由于本地化地图的不完全性，伪注释的质量可能无法有效地满足语义分割的要求。为了生成更密集的WSSS定位地图，本文提出了一种原型学习引导的多任务网络。最初，使用原型（也称为原型特征向量）来描述图像之间的相似性。具体而言，充分利用不同训练图像之间的共享信息，同时学习前景类别和背景类别的原型。该方法通过评估代表性原型与提取的像素特征之间的相似性，促进了更可靠的背景像素和前景区域的定位。此外，还可以将学习到的原型纳入多任务网络中，通过自适应校正像素级监督中的误差，提高参数优化的效率。因此，通过清晰的图像级标签和精细的像素级监督相结合，可以实现多任务网络的目标定位优化和高质量代理标注的生成。通过对代理标注的选择和细化，可以进一步提高分割算法的性能。在PASCAL VOC 2012和COCO 2014两个数据集上进行的大量实验证实，所提出的原型学习引导的多任务网络在分割性能方面优于当前最先进的（SOTA）方法，在PASCAL VOC 2012验证集和测试集上分别实现了72.1%和72.6%的平均IoU。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Signal Processing-Image Communication 工程技术-工程：电子与电气

CiteScore

8.40

自引率

2.90%

发文量

138

审稿时长

5.2 months

期刊介绍： Signal Processing: Image Communication is an international journal for the development of the theory and practice of image communication. Its primary objectives are the following: To present a forum for the advancement of theory and practice of image communication. To stimulate cross-fertilization between areas similar in nature which have traditionally been separated, for example, various aspects of visual communications and information systems. To contribute to a rapid information exchange between the industrial and academic environments. The editorial policy and the technical content of the journal are the responsibility of the Editor-in-Chief, the Area Editors and the Advisory Editors. The Journal is self-supporting from subscription income and contains a minimum amount of advertisements. Advertisements are subject to the prior approval of the Editor-in-Chief. The journal welcomes contributions from every country in the world. Signal Processing: Image Communication publishes articles relating to aspects of the design, implementation and use of image communication systems. The journal features original research work, tutorial and review articles, and accounts of practical developments. Subjects of interest include image/video coding, 3D video representations and compression, 3D graphics and animation compression, HDTV and 3DTV systems, video adaptation, video over IP, peer-to-peer video networking, interactive visual communication, multi-user video conferencing, wireless video broadcasting and communication, visual surveillance, 2D and 3D image/video quality measures, pre/post processing, video restoration and super-resolution, multi-camera video analysis, motion analysis, content-based image/video indexing and retrieval, face and gesture processing, video synthesis, 2D and 3D image/video acquisition and display technologies, architectures for image/video processing and communication.