用于弱监督对象定位的 CLIP 驱动变换器

Zhiwei Chen;Yunhang Shen;Liujuan Cao;Shengchuan Zhang;Rongrong Ji
{"title":"用于弱监督对象定位的 CLIP 驱动变换器","authors":"Zhiwei Chen;Yunhang Shen;Liujuan Cao;Shengchuan Zhang;Rongrong Ji","doi":"10.1109/TPAMI.2025.3548704","DOIUrl":null,"url":null,"abstract":"Weakly supervised object localization (WSOL) aims to localize objects using only image-level labels as supervision. Despite recent advancements incorporating transformers into WSOL have resulted in improvements, these methods often rely on category-agnostic attention maps, leading to suboptimal object localization. This paper presents a novel <bold>C</b>LIP-<bold>D</b>riven <bold>TR</b>ansformer (CDTR) that learns category-aware representations for accurate object localization. Specifically, we initially propose a Category-aware Stimulation Module (CSM) that embeds learnable category biases into self-attention maps, enhancing the learning process with auxiliary supervision. Additionally, an Object Constraint Module (OCM) is designed to refine object regions in a self-supervised manner, leveraging the discriminative potential of the self-attention maps provided by CSM. To create a synergistic connection between CSM and OCM, we further develop a Semantic Kernel Integrator (SKI), which generates a semantic kernel for self-attention maps. Meanwhile, we explore the CLIP model and design a Semantic Boost Adapter (SBA) to enrich object representations by integrating semantic-specific image and text representations into self-attention maps. Extensive experimental evaluations on benchmark datasets, such as <monospace>CUB-200-2011</monospace> and <monospace>ILSVRC</monospace> highlight the superior performance of our CDTR framework.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 6","pages":"4878-4896"},"PeriodicalIF":0.0000,"publicationDate":"2025-03-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"CLIP-Driven Transformer for Weakly Supervised Object Localization\",\"authors\":\"Zhiwei Chen;Yunhang Shen;Liujuan Cao;Shengchuan Zhang;Rongrong Ji\",\"doi\":\"10.1109/TPAMI.2025.3548704\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Weakly supervised object localization (WSOL) aims to localize objects using only image-level labels as supervision. Despite recent advancements incorporating transformers into WSOL have resulted in improvements, these methods often rely on category-agnostic attention maps, leading to suboptimal object localization. This paper presents a novel <bold>C</b>LIP-<bold>D</b>riven <bold>TR</b>ansformer (CDTR) that learns category-aware representations for accurate object localization. Specifically, we initially propose a Category-aware Stimulation Module (CSM) that embeds learnable category biases into self-attention maps, enhancing the learning process with auxiliary supervision. Additionally, an Object Constraint Module (OCM) is designed to refine object regions in a self-supervised manner, leveraging the discriminative potential of the self-attention maps provided by CSM. To create a synergistic connection between CSM and OCM, we further develop a Semantic Kernel Integrator (SKI), which generates a semantic kernel for self-attention maps. Meanwhile, we explore the CLIP model and design a Semantic Boost Adapter (SBA) to enrich object representations by integrating semantic-specific image and text representations into self-attention maps. Extensive experimental evaluations on benchmark datasets, such as <monospace>CUB-200-2011</monospace> and <monospace>ILSVRC</monospace> highlight the superior performance of our CDTR framework.\",\"PeriodicalId\":94034,\"journal\":{\"name\":\"IEEE transactions on pattern analysis and machine intelligence\",\"volume\":\"47 6\",\"pages\":\"4878-4896\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2025-03-14\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE transactions on pattern analysis and machine intelligence\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10927651/\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on pattern analysis and machine intelligence","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10927651/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

弱监督对象定位(WSOL)旨在仅使用图像级标签作为监督来定位对象。尽管最近将变形器纳入WSOL已经取得了进步,但这些方法通常依赖于与类别无关的注意图,导致次优对象定位。本文提出了一种新的CLIP-Driven TRansformer (CDTR),它可以学习类别感知表示以实现精确的对象定位。具体而言,我们初步提出了一个类别感知刺激模块(CSM),该模块将可学习的类别偏见嵌入到自注意图中,通过辅助监督来增强学习过程。此外,还设计了一个对象约束模块(OCM),利用CSM提供的自注意图的判别潜力,以自监督的方式对对象区域进行细化。为了在CSM和OCM之间建立协同连接,我们进一步开发了语义核积分器(SKI),它为自注意映射生成语义核。同时,我们探索了CLIP模型,并设计了一个语义增强适配器(Semantic Boost Adapter, SBA),通过将语义特定的图像和文本表示集成到自注意映射中来丰富对象表示。在基准数据集(如CUB-200-2011和ILSVRC)上进行的大量实验评估突出了我们的CDTR框架的优越性能。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
CLIP-Driven Transformer for Weakly Supervised Object Localization
Weakly supervised object localization (WSOL) aims to localize objects using only image-level labels as supervision. Despite recent advancements incorporating transformers into WSOL have resulted in improvements, these methods often rely on category-agnostic attention maps, leading to suboptimal object localization. This paper presents a novel CLIP-Driven TRansformer (CDTR) that learns category-aware representations for accurate object localization. Specifically, we initially propose a Category-aware Stimulation Module (CSM) that embeds learnable category biases into self-attention maps, enhancing the learning process with auxiliary supervision. Additionally, an Object Constraint Module (OCM) is designed to refine object regions in a self-supervised manner, leveraging the discriminative potential of the self-attention maps provided by CSM. To create a synergistic connection between CSM and OCM, we further develop a Semantic Kernel Integrator (SKI), which generates a semantic kernel for self-attention maps. Meanwhile, we explore the CLIP model and design a Semantic Boost Adapter (SBA) to enrich object representations by integrating semantic-specific image and text representations into self-attention maps. Extensive experimental evaluations on benchmark datasets, such as CUB-200-2011 and ILSVRC highlight the superior performance of our CDTR framework.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信