{"title":"WS-SAM: Generalizing SAM to Weakly Supervised Object Detection With Category Label","authors":"Hao Wang;Tong Jia;Qilong Wang;Wangmeng Zuo","doi":"10.1109/TIP.2025.3581729","DOIUrl":null,"url":null,"abstract":"Building an effective object detector usually depends on large well-annotated training samples. While annotating such dataset is extremely laborious and costly, where box-level supervision which contains both accurate classification category and localization coordinate is required. Compared to above box-level supervised annotation, those weakly supervised learning manners (e.g,, category, point and scribble) need relatively less laborious annotation cost, and provide a feasible way to mitigate the reliance on the dataset. Because of the lack of sufficient supervised information, current weakly supervised methods cannot achieve satisfactory detection performance. Recently, Segment Anything Model (SAM) has appeared as a task-agnostic foundation model and shown promising performance improvement in many related works due to its powerful generalization and data processing abilities. The properties of the SAM inspire us to adopt such basic benchmark to weakly supervised object detection field to compensate the deficiencies in supervised information. However, directly deploying SAM on weakly supervised object detection task meets with two issues. Firstly, SAM needs meticulously-designed prompts, and such expert-level prompts restrict their applicability and practicality. Besides, SAM is a category unawareness model, and it cannot assign the category labels to the generated predictions. To solve above issues, we propose WS-SAM, which generalizes Segment Anything Model (SAM) to weakly supervised object detection with category label. Specifically, we design an adaptive prompt generator to take full advantages of the spatial and semantic information from the prompt. It employs in a self-prompting manner by taking the output of SAM from the previous iteration as the prompt input to guide the next iteration, where the prompts can be adaptively generated based on the classification activation map. We also develop a segmentation mask refinement module and formulate the label assignment process as a shortest path optimization problem by considering the similarity between each location and prompts. Furthermore, a bidirectional adapter is also implemented to resolve the domain discrepancy by incorporating domain-specific information. We evaluate the effectiveness of our method on several detection datasets (e.g., PASCAL VOC and MS COCO), and the experiment results show that our proposed method can achieve clear improvement over state-of-the-art methods, while performing favorably against state-of-the-arts.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"4052-4066"},"PeriodicalIF":0.0000,"publicationDate":"2025-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/11053233/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Building an effective object detector usually depends on large well-annotated training samples. While annotating such dataset is extremely laborious and costly, where box-level supervision which contains both accurate classification category and localization coordinate is required. Compared to above box-level supervised annotation, those weakly supervised learning manners (e.g,, category, point and scribble) need relatively less laborious annotation cost, and provide a feasible way to mitigate the reliance on the dataset. Because of the lack of sufficient supervised information, current weakly supervised methods cannot achieve satisfactory detection performance. Recently, Segment Anything Model (SAM) has appeared as a task-agnostic foundation model and shown promising performance improvement in many related works due to its powerful generalization and data processing abilities. The properties of the SAM inspire us to adopt such basic benchmark to weakly supervised object detection field to compensate the deficiencies in supervised information. However, directly deploying SAM on weakly supervised object detection task meets with two issues. Firstly, SAM needs meticulously-designed prompts, and such expert-level prompts restrict their applicability and practicality. Besides, SAM is a category unawareness model, and it cannot assign the category labels to the generated predictions. To solve above issues, we propose WS-SAM, which generalizes Segment Anything Model (SAM) to weakly supervised object detection with category label. Specifically, we design an adaptive prompt generator to take full advantages of the spatial and semantic information from the prompt. It employs in a self-prompting manner by taking the output of SAM from the previous iteration as the prompt input to guide the next iteration, where the prompts can be adaptively generated based on the classification activation map. We also develop a segmentation mask refinement module and formulate the label assignment process as a shortest path optimization problem by considering the similarity between each location and prompts. Furthermore, a bidirectional adapter is also implemented to resolve the domain discrepancy by incorporating domain-specific information. We evaluate the effectiveness of our method on several detection datasets (e.g., PASCAL VOC and MS COCO), and the experiment results show that our proposed method can achieve clear improvement over state-of-the-art methods, while performing favorably against state-of-the-arts.