Dual-branch contrastive learning for weakly supervised object localization

IF 3.4 2区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Applied Intelligence Pub Date : 2025-04-03 DOI:10.1007/s10489-025-06514-1

Zebin Guo, Dong Li, Zhengjun Du, Bingfeng Seng

{"title":"Dual-branch contrastive learning for weakly supervised object localization","authors":"Zebin Guo, Dong Li, Zhengjun Du, Bingfeng Seng","doi":"10.1007/s10489-025-06514-1","DOIUrl":null,"url":null,"abstract":"<div><p>The weakly supervised object localization task uses image-level labels to train object localization models. Traditional convolutional neural network (CNN)-based methods usually localize objects using a class activation map. However, the class activation map usually suffers from the problem of activating a small part of the object that is most discriminative. Meanwhile, the methods based on the Vision Transformer can capture long-range feature dependencies but tend to ignore local feature details. In this paper, we innovatively propose a dual-branch contrastive learning (DBC) method that consists of a Transformer and a CNN branch. The method can effectively separate the background and foreground of an image and fuse the features of Transformer and CNN through contrastive learning. Specifically, the method separates the background and foreground representations of the image using the initially generated class-agnostic activation maps. Then, the representations of the same image from different branches form positive pairs for contrastive learning. The background and foreground representations from the same branch form negative pairs. Finally, the DBC method forces the model to separate the background and foreground representations through negative contrastive loss and makes the model fuse the features of two branches through positive contrastive loss. Experiments on the ILSVRC benchmark show that the proposed method can achieve a Top-1 localization accuracy of 59.9% and a GT-known localization accuracy of 71.7%, which are better metrics than those of the state-of-the-art methods with the same parameter complexity.</p></div>","PeriodicalId":8041,"journal":{"name":"Applied Intelligence","volume":"55 7","pages":""},"PeriodicalIF":3.4000,"publicationDate":"2025-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Applied Intelligence","FirstCategoryId":"94","ListUrlMain":"https://link.springer.com/article/10.1007/s10489-025-06514-1","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

The weakly supervised object localization task uses image-level labels to train object localization models. Traditional convolutional neural network (CNN)-based methods usually localize objects using a class activation map. However, the class activation map usually suffers from the problem of activating a small part of the object that is most discriminative. Meanwhile, the methods based on the Vision Transformer can capture long-range feature dependencies but tend to ignore local feature details. In this paper, we innovatively propose a dual-branch contrastive learning (DBC) method that consists of a Transformer and a CNN branch. The method can effectively separate the background and foreground of an image and fuse the features of Transformer and CNN through contrastive learning. Specifically, the method separates the background and foreground representations of the image using the initially generated class-agnostic activation maps. Then, the representations of the same image from different branches form positive pairs for contrastive learning. The background and foreground representations from the same branch form negative pairs. Finally, the DBC method forces the model to separate the background and foreground representations through negative contrastive loss and makes the model fuse the features of two branches through positive contrastive loss. Experiments on the ILSVRC benchmark show that the proposed method can achieve a Top-1 localization accuracy of 59.9% and a GT-known localization accuracy of 71.7%, which are better metrics than those of the state-of-the-art methods with the same parameter complexity.

Abstract Image

查看原文本刊更多论文

弱监督对象定位的双分支对比学习

弱监督对象定位任务使用图像级标签来训练对象定位模型。传统的基于卷积神经网络（CNN）的方法通常使用类激活图来定位对象。然而，类激活映射通常会遇到激活对象中最具区别性的一小部分的问题。同时，基于Vision Transformer的方法可以捕获远程特征依赖关系，但往往忽略局部特征细节。在本文中，我们创新地提出了一种双分支对比学习（DBC）方法，该方法由一个Transformer和一个CNN分支组成。该方法可以有效地分离图像的背景和前景，并通过对比学习融合Transformer和CNN的特征。具体来说，该方法使用最初生成的与类别无关的激活图分离图像的背景和前景表示。然后，来自不同分支的同一图像的表示形成正对进行对比学习。来自同一分支的背景和前景表示形成负对。最后，DBC方法通过负对比损耗迫使模型分离背景和前景表示，并通过正对比损耗使模型融合两个分支的特征。在ILSVRC基准上进行的实验表明，该方法的Top-1定位精度为59.9%，GT-known定位精度为71.7%，在相同参数复杂度下优于现有方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Applied Intelligence 工程技术-计算机：人工智能

CiteScore

6.60

自引率

20.80%

发文量

1361

审稿时长

5.9 months

期刊介绍： With a focus on research in artificial intelligence and neural networks, this journal addresses issues involving solutions of real-life manufacturing, defense, management, government and industrial problems which are too complex to be solved through conventional approaches and require the simulation of intelligent thought processes, heuristics, applications of knowledge, and distributed and parallel processing. The integration of these multiple approaches in solving complex problems is of particular importance. The journal presents new and original research and technological developments, addressing real and complex issues applicable to difficult problems. It provides a medium for exchanging scientific research and technological achievements accomplished by the international community.