Object-Level and Scene-Level Feature Aggregation with CLIP for scene recognition

IF 14.7 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE
Qun Wang , Feng Zhu , Ge Wu , Pengfei Zhao , Jianyu Wang , Xiang Li
{"title":"Object-Level and Scene-Level Feature Aggregation with CLIP for scene recognition","authors":"Qun Wang ,&nbsp;Feng Zhu ,&nbsp;Ge Wu ,&nbsp;Pengfei Zhao ,&nbsp;Jianyu Wang ,&nbsp;Xiang Li","doi":"10.1016/j.inffus.2025.103118","DOIUrl":null,"url":null,"abstract":"<div><div>Scene recognition is a fundamental task in computer vision, pivotal for applications like visual navigation and robotics. However, traditional methods struggle to effectively capture and aggregate scene-related features due to the inherent complexity and diversity of scenes, often leading to sub-optimal performance. To address this limitation, we propose a novel method, named OSFA (<strong>O</strong>bject-level and <strong>S</strong>cene-level <strong>F</strong>eature <strong>A</strong>ggregation), that leverages CLIP’s multimodal strengths to enhance scene feature representation through a two-stage aggregation strategy: Object-Level Feature Aggregation (OLFA) and Scene-Level Feature Aggregation (SLFA). In OLFA, we first generate an initial scene feature by integrating the average-pooled feature map of the base visual encoder and the CLIP visual feature. The initial scene feature is then used as a query in object-level cross-attention to extract object-level details most relevant to the scene from the feature map, thereby enhancing the representation. In SLFA, we first use CLIP’s textual encoder to provide category-level textual features for the scene, guiding the aggregation of corresponding visual features from the feature map. OLFA’s enhanced scene feature then queries these category-aware features using scene-level cross-attention to further capture scene-level information and obtain the final scene representation. To strengthen training, we employ a multi-loss strategy inspired by contrastive learning, improving feature robustness and discriminative ability. We evaluate OSFA on three challenging datasets (i.e. Places365, MIT67, and SUN397), achieving substantial improvements in classification accuracy. These results highlight the effectiveness of our method in enhancing scene feature representation through CLIP-guided aggregation. This advancement significantly improves scene recognition performance. Our code is public at <span><span>https://github.com/WangqunQAQ/OSFA</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50367,"journal":{"name":"Information Fusion","volume":"120 ","pages":"Article 103118"},"PeriodicalIF":14.7000,"publicationDate":"2025-03-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information Fusion","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1566253525001915","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0

Abstract

Scene recognition is a fundamental task in computer vision, pivotal for applications like visual navigation and robotics. However, traditional methods struggle to effectively capture and aggregate scene-related features due to the inherent complexity and diversity of scenes, often leading to sub-optimal performance. To address this limitation, we propose a novel method, named OSFA (Object-level and Scene-level Feature Aggregation), that leverages CLIP’s multimodal strengths to enhance scene feature representation through a two-stage aggregation strategy: Object-Level Feature Aggregation (OLFA) and Scene-Level Feature Aggregation (SLFA). In OLFA, we first generate an initial scene feature by integrating the average-pooled feature map of the base visual encoder and the CLIP visual feature. The initial scene feature is then used as a query in object-level cross-attention to extract object-level details most relevant to the scene from the feature map, thereby enhancing the representation. In SLFA, we first use CLIP’s textual encoder to provide category-level textual features for the scene, guiding the aggregation of corresponding visual features from the feature map. OLFA’s enhanced scene feature then queries these category-aware features using scene-level cross-attention to further capture scene-level information and obtain the final scene representation. To strengthen training, we employ a multi-loss strategy inspired by contrastive learning, improving feature robustness and discriminative ability. We evaluate OSFA on three challenging datasets (i.e. Places365, MIT67, and SUN397), achieving substantial improvements in classification accuracy. These results highlight the effectiveness of our method in enhancing scene feature representation through CLIP-guided aggregation. This advancement significantly improves scene recognition performance. Our code is public at https://github.com/WangqunQAQ/OSFA.
使用CLIP进行场景识别的对象级和场景级特征聚合
场景识别是计算机视觉的一项基本任务,对于视觉导航和机器人等应用至关重要。然而,由于场景固有的复杂性和多样性,传统的方法难以有效地捕获和聚合场景相关的特征,往往导致次优性能。为了解决这一限制,我们提出了一种名为OSFA(对象级和场景级特征聚合)的新方法,该方法利用CLIP的多模态优势,通过两阶段聚合策略:对象级特征聚合(OLFA)和场景级特征聚合(SLFA)来增强场景特征表示。在OLFA中,我们首先通过整合基本视觉编码器和CLIP视觉特征的平均池特征映射来生成初始场景特征。然后将初始场景特征用作对象级交叉关注的查询,从特征映射中提取与场景最相关的对象级细节,从而增强表征。在SLFA中,我们首先使用CLIP的文本编码器为场景提供类别级别的文本特征,引导从特征映射中聚合相应的视觉特征。OLFA的增强场景特征然后使用场景级交叉注意查询这些类别感知特征,以进一步捕获场景级信息并获得最终的场景表示。为了加强训练,我们采用了基于对比学习的多损失策略,提高了特征的鲁棒性和判别能力。我们在三个具有挑战性的数据集(即Places365、MIT67和SUN397)上评估了OSFA,在分类精度上取得了实质性的提高。这些结果突出了我们的方法在通过clip引导聚合增强场景特征表示方面的有效性。这一进步显著提高了场景识别性能。我们的代码在https://github.com/WangqunQAQ/OSFA上公开。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Information Fusion
Information Fusion 工程技术-计算机:理论方法
CiteScore
33.20
自引率
4.30%
发文量
161
审稿时长
7.9 months
期刊介绍: Information Fusion serves as a central platform for showcasing advancements in multi-sensor, multi-source, multi-process information fusion, fostering collaboration among diverse disciplines driving its progress. It is the leading outlet for sharing research and development in this field, focusing on architectures, algorithms, and applications. Papers dealing with fundamental theoretical analyses as well as those demonstrating their application to real-world problems will be welcome.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信