CLIP-Driven Transformer for Weakly Supervised Object Localization

IEEE transactions on pattern analysis and machine intelligence Pub Date : 2025-03-14 DOI:10.1109/TPAMI.2025.3548704

Zhiwei Chen;Yunhang Shen;Liujuan Cao;Shengchuan Zhang;Rongrong Ji

{"title":"CLIP-Driven Transformer for Weakly Supervised Object Localization","authors":"Zhiwei Chen;Yunhang Shen;Liujuan Cao;Shengchuan Zhang;Rongrong Ji","doi":"10.1109/TPAMI.2025.3548704","DOIUrl":null,"url":null,"abstract":"Weakly supervised object localization (WSOL) aims to localize objects using only image-level labels as supervision. Despite recent advancements incorporating transformers into WSOL have resulted in improvements, these methods often rely on category-agnostic attention maps, leading to suboptimal object localization. This paper presents a novel <bold>C</b>LIP-<bold>D</b>riven <bold>TR</b>ansformer (CDTR) that learns category-aware representations for accurate object localization. Specifically, we initially propose a Category-aware Stimulation Module (CSM) that embeds learnable category biases into self-attention maps, enhancing the learning process with auxiliary supervision. Additionally, an Object Constraint Module (OCM) is designed to refine object regions in a self-supervised manner, leveraging the discriminative potential of the self-attention maps provided by CSM. To create a synergistic connection between CSM and OCM, we further develop a Semantic Kernel Integrator (SKI), which generates a semantic kernel for self-attention maps. Meanwhile, we explore the CLIP model and design a Semantic Boost Adapter (SBA) to enrich object representations by integrating semantic-specific image and text representations into self-attention maps. Extensive experimental evaluations on benchmark datasets, such as <monospace>CUB-200-2011</monospace> and <monospace>ILSVRC</monospace> highlight the superior performance of our CDTR framework.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 6","pages":"4878-4896"},"PeriodicalIF":0.0000,"publicationDate":"2025-03-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on pattern analysis and machine intelligence","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10927651/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Weakly supervised object localization (WSOL) aims to localize objects using only image-level labels as supervision. Despite recent advancements incorporating transformers into WSOL have resulted in improvements, these methods often rely on category-agnostic attention maps, leading to suboptimal object localization. This paper presents a novel CLIP-Driven TRansformer (CDTR) that learns category-aware representations for accurate object localization. Specifically, we initially propose a Category-aware Stimulation Module (CSM) that embeds learnable category biases into self-attention maps, enhancing the learning process with auxiliary supervision. Additionally, an Object Constraint Module (OCM) is designed to refine object regions in a self-supervised manner, leveraging the discriminative potential of the self-attention maps provided by CSM. To create a synergistic connection between CSM and OCM, we further develop a Semantic Kernel Integrator (SKI), which generates a semantic kernel for self-attention maps. Meanwhile, we explore the CLIP model and design a Semantic Boost Adapter (SBA) to enrich object representations by integrating semantic-specific image and text representations into self-attention maps. Extensive experimental evaluations on benchmark datasets, such as CUB-200-2011 and ILSVRC highlight the superior performance of our CDTR framework.

查看原文本刊更多论文

用于弱监督对象定位的 CLIP 驱动变换器

弱监督对象定位（WSOL）旨在仅使用图像级标签作为监督来定位对象。尽管最近将变形器纳入WSOL已经取得了进步，但这些方法通常依赖于与类别无关的注意图，导致次优对象定位。本文提出了一种新的CLIP-Driven TRansformer (CDTR)，它可以学习类别感知表示以实现精确的对象定位。具体而言，我们初步提出了一个类别感知刺激模块（CSM），该模块将可学习的类别偏见嵌入到自注意图中，通过辅助监督来增强学习过程。此外，还设计了一个对象约束模块（OCM），利用CSM提供的自注意图的判别潜力，以自监督的方式对对象区域进行细化。为了在CSM和OCM之间建立协同连接，我们进一步开发了语义核积分器（SKI），它为自注意映射生成语义核。同时，我们探索了CLIP模型，并设计了一个语义增强适配器（Semantic Boost Adapter， SBA），通过将语义特定的图像和文本表示集成到自注意映射中来丰富对象表示。在基准数据集（如CUB-200-2011和ILSVRC）上进行的大量实验评估突出了我们的CDTR框架的优越性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE transactions on pattern analysis and machine intelligence

自引率

0.00%

发文量