Local Cross-Patch Activation From Multi-Direction for Weakly Supervised Object Localization

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society Pub Date : 2025-03-31 DOI:10.1109/TIP.2025.3554398

Pei Lv;Junying Ren;Genwang Han;Jiwen Lu;Mingliang Xu

{"title":"Local Cross-Patch Activation From Multi-Direction for Weakly Supervised Object Localization","authors":"Pei Lv;Junying Ren;Genwang Han;Jiwen Lu;Mingliang Xu","doi":"10.1109/TIP.2025.3554398","DOIUrl":null,"url":null,"abstract":"Weakly supervised object localization (WSOL) learns to localize objects using only image-level labels. Recently, some studies apply transformers in WSOL to capture the long-range feature dependency and alleviate the partial activation issue of CNN-based methods. However, existing transformer-based methods still face two challenges. The first challenge is the over-activation of backgrounds. Specifically, the object boundaries and background are often semantically similar, and localization models may misidentify the background as a part of objects. The second challenge is the incomplete activation of occluded objects, since transformer architecture makes it difficult to capture local features across patches due to ignoring semantic and spatial coherence. To address these issues, in this paper, we propose LCA-MD, a novel transformer-based WSOL method using local cross-patch activation from multi-direction, which can capture more details of local features while inhibiting the background over-activation. In LCA-MD, first, combining contrastive learning with the transformer, we propose a token feature contrast module (TCM) that can maximize the difference between foregrounds and backgrounds and further separate them more accurately. Second, we propose a semantic-spatial fusion module (SFM), which leverages multi-directional perception to capture the local cross-patch features and diffuse activation across occlusions. Experiment results on the CUB-200-2011 and ILSVRC datasets demonstrate that our LCA-MD is significantly superior and has achieved state-of-the-art results in WSOL. The project code is available at <uri>https://github.com/rjy-fighting/LCA-MD</uri>.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"2213-2227"},"PeriodicalIF":0.0000,"publicationDate":"2025-03-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10945987/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Weakly supervised object localization (WSOL) learns to localize objects using only image-level labels. Recently, some studies apply transformers in WSOL to capture the long-range feature dependency and alleviate the partial activation issue of CNN-based methods. However, existing transformer-based methods still face two challenges. The first challenge is the over-activation of backgrounds. Specifically, the object boundaries and background are often semantically similar, and localization models may misidentify the background as a part of objects. The second challenge is the incomplete activation of occluded objects, since transformer architecture makes it difficult to capture local features across patches due to ignoring semantic and spatial coherence. To address these issues, in this paper, we propose LCA-MD, a novel transformer-based WSOL method using local cross-patch activation from multi-direction, which can capture more details of local features while inhibiting the background over-activation. In LCA-MD, first, combining contrastive learning with the transformer, we propose a token feature contrast module (TCM) that can maximize the difference between foregrounds and backgrounds and further separate them more accurately. Second, we propose a semantic-spatial fusion module (SFM), which leverages multi-directional perception to capture the local cross-patch features and diffuse activation across occlusions. Experiment results on the CUB-200-2011 and ILSVRC datasets demonstrate that our LCA-MD is significantly superior and has achieved state-of-the-art results in WSOL. The project code is available at https://github.com/rjy-fighting/LCA-MD.

查看原文本刊更多论文

基于多方向局部交叉补丁激活的弱监督目标定位

弱监督对象定位（WSOL）学习仅使用图像级标签来定位对象。近年来，一些研究在WSOL中应用变压器来捕获远程特征依赖，以缓解基于cnn的方法的部分激活问题。然而，现有的基于变压器的方法仍然面临两个挑战。第一个挑战是背景的过度激活。具体来说，目标边界和背景往往在语义上相似，定位模型可能会将背景误认为是目标的一部分。第二个挑战是遮挡对象的不完全激活，因为变压器架构由于忽略了语义和空间一致性，使得很难捕获跨补丁的局部特征。为了解决这些问题，本文提出了一种新的基于变压器的WSOL方法LCA-MD，该方法利用多方向局部交叉补丁激活，可以捕获更多的局部特征细节，同时抑制背景过度激活。在LCA-MD中，首先，我们将对比学习与变压器相结合，提出了一种token特征对比模块（TCM），该模块可以最大化前景和背景之间的差异，并进一步更准确地分离它们。其次，我们提出了一个语义空间融合模块（SFM），该模块利用多向感知捕获局部交叉斑块特征和跨遮挡的弥漫性激活。在CUB-200-2011和ILSVRC数据集上的实验结果表明，我们的LCA-MD具有明显的优势，并在WSOL中取得了最先进的结果。项目代码可从https://github.com/rjy-fighting/LCA-MD获得。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society

自引率

0.00%

发文量