Zhiwei Chen;Yunhang Shen;Liujuan Cao;Shengchuan Zhang;Rongrong Ji
{"title":"CLIP-Driven Transformer for Weakly Supervised Object Localization","authors":"Zhiwei Chen;Yunhang Shen;Liujuan Cao;Shengchuan Zhang;Rongrong Ji","doi":"10.1109/TPAMI.2025.3548704","DOIUrl":null,"url":null,"abstract":"Weakly supervised object localization (WSOL) aims to localize objects using only image-level labels as supervision. Despite recent advancements incorporating transformers into WSOL have resulted in improvements, these methods often rely on category-agnostic attention maps, leading to suboptimal object localization. This paper presents a novel <bold>C</b>LIP-<bold>D</b>riven <bold>TR</b>ansformer (CDTR) that learns category-aware representations for accurate object localization. Specifically, we initially propose a Category-aware Stimulation Module (CSM) that embeds learnable category biases into self-attention maps, enhancing the learning process with auxiliary supervision. Additionally, an Object Constraint Module (OCM) is designed to refine object regions in a self-supervised manner, leveraging the discriminative potential of the self-attention maps provided by CSM. To create a synergistic connection between CSM and OCM, we further develop a Semantic Kernel Integrator (SKI), which generates a semantic kernel for self-attention maps. Meanwhile, we explore the CLIP model and design a Semantic Boost Adapter (SBA) to enrich object representations by integrating semantic-specific image and text representations into self-attention maps. Extensive experimental evaluations on benchmark datasets, such as <monospace>CUB-200-2011</monospace> and <monospace>ILSVRC</monospace> highlight the superior performance of our CDTR framework.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 6","pages":"4878-4896"},"PeriodicalIF":0.0000,"publicationDate":"2025-03-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on pattern analysis and machine intelligence","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10927651/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Weakly supervised object localization (WSOL) aims to localize objects using only image-level labels as supervision. Despite recent advancements incorporating transformers into WSOL have resulted in improvements, these methods often rely on category-agnostic attention maps, leading to suboptimal object localization. This paper presents a novel CLIP-Driven TRansformer (CDTR) that learns category-aware representations for accurate object localization. Specifically, we initially propose a Category-aware Stimulation Module (CSM) that embeds learnable category biases into self-attention maps, enhancing the learning process with auxiliary supervision. Additionally, an Object Constraint Module (OCM) is designed to refine object regions in a self-supervised manner, leveraging the discriminative potential of the self-attention maps provided by CSM. To create a synergistic connection between CSM and OCM, we further develop a Semantic Kernel Integrator (SKI), which generates a semantic kernel for self-attention maps. Meanwhile, we explore the CLIP model and design a Semantic Boost Adapter (SBA) to enrich object representations by integrating semantic-specific image and text representations into self-attention maps. Extensive experimental evaluations on benchmark datasets, such as CUB-200-2011 and ILSVRC highlight the superior performance of our CDTR framework.