WS-SAM: Generalizing SAM to Weakly Supervised Object Detection With Category Label

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society Pub Date : 2025-06-26 DOI:10.1109/TIP.2025.3581729

Hao Wang;Tong Jia;Qilong Wang;Wangmeng Zuo

{"title":"WS-SAM: Generalizing SAM to Weakly Supervised Object Detection With Category Label","authors":"Hao Wang;Tong Jia;Qilong Wang;Wangmeng Zuo","doi":"10.1109/TIP.2025.3581729","DOIUrl":null,"url":null,"abstract":"Building an effective object detector usually depends on large well-annotated training samples. While annotating such dataset is extremely laborious and costly, where box-level supervision which contains both accurate classification category and localization coordinate is required. Compared to above box-level supervised annotation, those weakly supervised learning manners (e.g,, category, point and scribble) need relatively less laborious annotation cost, and provide a feasible way to mitigate the reliance on the dataset. Because of the lack of sufficient supervised information, current weakly supervised methods cannot achieve satisfactory detection performance. Recently, Segment Anything Model (SAM) has appeared as a task-agnostic foundation model and shown promising performance improvement in many related works due to its powerful generalization and data processing abilities. The properties of the SAM inspire us to adopt such basic benchmark to weakly supervised object detection field to compensate the deficiencies in supervised information. However, directly deploying SAM on weakly supervised object detection task meets with two issues. Firstly, SAM needs meticulously-designed prompts, and such expert-level prompts restrict their applicability and practicality. Besides, SAM is a category unawareness model, and it cannot assign the category labels to the generated predictions. To solve above issues, we propose WS-SAM, which generalizes Segment Anything Model (SAM) to weakly supervised object detection with category label. Specifically, we design an adaptive prompt generator to take full advantages of the spatial and semantic information from the prompt. It employs in a self-prompting manner by taking the output of SAM from the previous iteration as the prompt input to guide the next iteration, where the prompts can be adaptively generated based on the classification activation map. We also develop a segmentation mask refinement module and formulate the label assignment process as a shortest path optimization problem by considering the similarity between each location and prompts. Furthermore, a bidirectional adapter is also implemented to resolve the domain discrepancy by incorporating domain-specific information. We evaluate the effectiveness of our method on several detection datasets (e.g., PASCAL VOC and MS COCO), and the experiment results show that our proposed method can achieve clear improvement over state-of-the-art methods, while performing favorably against state-of-the-arts.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"4052-4066"},"PeriodicalIF":0.0000,"publicationDate":"2025-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/11053233/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Building an effective object detector usually depends on large well-annotated training samples. While annotating such dataset is extremely laborious and costly, where box-level supervision which contains both accurate classification category and localization coordinate is required. Compared to above box-level supervised annotation, those weakly supervised learning manners (e.g,, category, point and scribble) need relatively less laborious annotation cost, and provide a feasible way to mitigate the reliance on the dataset. Because of the lack of sufficient supervised information, current weakly supervised methods cannot achieve satisfactory detection performance. Recently, Segment Anything Model (SAM) has appeared as a task-agnostic foundation model and shown promising performance improvement in many related works due to its powerful generalization and data processing abilities. The properties of the SAM inspire us to adopt such basic benchmark to weakly supervised object detection field to compensate the deficiencies in supervised information. However, directly deploying SAM on weakly supervised object detection task meets with two issues. Firstly, SAM needs meticulously-designed prompts, and such expert-level prompts restrict their applicability and practicality. Besides, SAM is a category unawareness model, and it cannot assign the category labels to the generated predictions. To solve above issues, we propose WS-SAM, which generalizes Segment Anything Model (SAM) to weakly supervised object detection with category label. Specifically, we design an adaptive prompt generator to take full advantages of the spatial and semantic information from the prompt. It employs in a self-prompting manner by taking the output of SAM from the previous iteration as the prompt input to guide the next iteration, where the prompts can be adaptively generated based on the classification activation map. We also develop a segmentation mask refinement module and formulate the label assignment process as a shortest path optimization problem by considering the similarity between each location and prompts. Furthermore, a bidirectional adapter is also implemented to resolve the domain discrepancy by incorporating domain-specific information. We evaluate the effectiveness of our method on several detection datasets (e.g., PASCAL VOC and MS COCO), and the experiment results show that our proposed method can achieve clear improvement over state-of-the-art methods, while performing favorably against state-of-the-arts.

查看原文本刊更多论文

WS-SAM：将SAM推广到带类别标签的弱监督对象检测

构建一个有效的目标检测器通常依赖于大量有良好注释的训练样本。而标注这样的数据集是非常费力和昂贵的，需要同时包含准确的分类类别和定位坐标的盒级监督。与上述盒级监督标注相比，弱监督学习方式（如类别、点和涂鸦）的标注成本相对较低，并提供了一种可行的方法来减轻对数据集的依赖。由于缺乏足够的监督信息，现有的弱监督方法无法达到令人满意的检测性能。近年来，分段任意模型（SAM）作为一种任务不可知的基础模型出现，由于其强大的泛化能力和数据处理能力，在许多相关研究中表现出了良好的性能提升。SAM的特性启发我们在弱监督目标检测领域采用这种基本基准来弥补监督信息的不足。然而，在弱监督对象检测任务上直接部署SAM会遇到两个问题。首先，SAM需要精心设计提示，这种专家级提示限制了SAM的适用性和实用性。此外，SAM是一个类别无意识模型，它不能为生成的预测分配类别标签。为了解决上述问题，我们提出了WS-SAM，它将分段任意模型（SAM）推广到带类别标签的弱监督对象检测。具体来说，我们设计了一个自适应提示符生成器，充分利用提示符的空间信息和语义信息。它采用自提示的方式，将前一次迭代的SAM输出作为提示输入来指导下一次迭代，在下一次迭代中，可以根据分类激活图自适应地生成提示。我们还开发了一个分割掩码优化模块，并通过考虑每个位置和提示之间的相似性，将标签分配过程制定为最短路径优化问题。此外，还实现了双向适配器，通过合并特定于领域的信息来解决领域差异。我们在几个检测数据集（例如PASCAL VOC和MS COCO）上评估了我们的方法的有效性，实验结果表明，我们提出的方法可以比最先进的方法取得明显的改进，同时与最先进的方法相比表现良好。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society

自引率

0.00%

发文量