{"title":"Weakly supervised instance segmentation via class double-activation maps and boundary localization","authors":"Jin Peng, Yongxiong Wang, Zhiqun Pan","doi":"10.1016/j.image.2024.117150","DOIUrl":null,"url":null,"abstract":"<div><p>Weakly supervised instance segmentation based on image-level class labels has recently gained much attention, in which the primary key step is to generate the pseudo labels based on class activation maps (CAMs). Most methods adopt binary cross-entropy (BCE) loss to train the classification model. However, since BCE loss is not class mutually exclusive, activations among classes occur independently. Thus, not only do foreground classes are wrongly activated as background, but also incorrect activations among confusing classes are occurred in the foreground. To solve this problem, we propose the Class Double-Activation Map, called Double-CAM. Firstly, the vanilla CAM is extracted from the multi-label classifier and then fused with the output feature map of backbone. The enhanced feature map of each class is fed into the single-label classification branch with softmax cross-entropy (SCE) loss and entropy minimization module, from which the more accurate Double-CAM is extracted. It refines the vanilla CAM to improve the quality of pseudo labels. Secondly, to mine object edge cues from Double-CAM, we propose the Boundary Localization (BL) module to synthesize boundary annotations, so as to provide constraints for label propagation more explicitly without adding additional supervision. The quality of pseudo masks is also improved substantially with the addition of BL module. Finally, the generated pseudo labels are used to train fully supervised instance segmentation networks. The evaluations on VOC and COCO datasets show that our method achieves excellent performance, outperforming mainstream weakly supervised segmentation methods at the same supervisory level, even those that depend on stronger supervision.</p></div>","PeriodicalId":49521,"journal":{"name":"Signal Processing-Image Communication","volume":"127 ","pages":"Article 117150"},"PeriodicalIF":3.4000,"publicationDate":"2024-05-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Signal Processing-Image Communication","FirstCategoryId":"5","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0923596524000511","RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
引用次数: 0
Abstract
Weakly supervised instance segmentation based on image-level class labels has recently gained much attention, in which the primary key step is to generate the pseudo labels based on class activation maps (CAMs). Most methods adopt binary cross-entropy (BCE) loss to train the classification model. However, since BCE loss is not class mutually exclusive, activations among classes occur independently. Thus, not only do foreground classes are wrongly activated as background, but also incorrect activations among confusing classes are occurred in the foreground. To solve this problem, we propose the Class Double-Activation Map, called Double-CAM. Firstly, the vanilla CAM is extracted from the multi-label classifier and then fused with the output feature map of backbone. The enhanced feature map of each class is fed into the single-label classification branch with softmax cross-entropy (SCE) loss and entropy minimization module, from which the more accurate Double-CAM is extracted. It refines the vanilla CAM to improve the quality of pseudo labels. Secondly, to mine object edge cues from Double-CAM, we propose the Boundary Localization (BL) module to synthesize boundary annotations, so as to provide constraints for label propagation more explicitly without adding additional supervision. The quality of pseudo masks is also improved substantially with the addition of BL module. Finally, the generated pseudo labels are used to train fully supervised instance segmentation networks. The evaluations on VOC and COCO datasets show that our method achieves excellent performance, outperforming mainstream weakly supervised segmentation methods at the same supervisory level, even those that depend on stronger supervision.
期刊介绍:
Signal Processing: Image Communication is an international journal for the development of the theory and practice of image communication. Its primary objectives are the following:
To present a forum for the advancement of theory and practice of image communication.
To stimulate cross-fertilization between areas similar in nature which have traditionally been separated, for example, various aspects of visual communications and information systems.
To contribute to a rapid information exchange between the industrial and academic environments.
The editorial policy and the technical content of the journal are the responsibility of the Editor-in-Chief, the Area Editors and the Advisory Editors. The Journal is self-supporting from subscription income and contains a minimum amount of advertisements. Advertisements are subject to the prior approval of the Editor-in-Chief. The journal welcomes contributions from every country in the world.
Signal Processing: Image Communication publishes articles relating to aspects of the design, implementation and use of image communication systems. The journal features original research work, tutorial and review articles, and accounts of practical developments.
Subjects of interest include image/video coding, 3D video representations and compression, 3D graphics and animation compression, HDTV and 3DTV systems, video adaptation, video over IP, peer-to-peer video networking, interactive visual communication, multi-user video conferencing, wireless video broadcasting and communication, visual surveillance, 2D and 3D image/video quality measures, pre/post processing, video restoration and super-resolution, multi-camera video analysis, motion analysis, content-based image/video indexing and retrieval, face and gesture processing, video synthesis, 2D and 3D image/video acquisition and display technologies, architectures for image/video processing and communication.