Qun Wang , Feng Zhu , Ge Wu , Pengfei Zhao , Jianyu Wang , Xiang Li
{"title":"使用CLIP进行场景识别的对象级和场景级特征聚合","authors":"Qun Wang , Feng Zhu , Ge Wu , Pengfei Zhao , Jianyu Wang , Xiang Li","doi":"10.1016/j.inffus.2025.103118","DOIUrl":null,"url":null,"abstract":"<div><div>Scene recognition is a fundamental task in computer vision, pivotal for applications like visual navigation and robotics. However, traditional methods struggle to effectively capture and aggregate scene-related features due to the inherent complexity and diversity of scenes, often leading to sub-optimal performance. To address this limitation, we propose a novel method, named OSFA (<strong>O</strong>bject-level and <strong>S</strong>cene-level <strong>F</strong>eature <strong>A</strong>ggregation), that leverages CLIP’s multimodal strengths to enhance scene feature representation through a two-stage aggregation strategy: Object-Level Feature Aggregation (OLFA) and Scene-Level Feature Aggregation (SLFA). In OLFA, we first generate an initial scene feature by integrating the average-pooled feature map of the base visual encoder and the CLIP visual feature. The initial scene feature is then used as a query in object-level cross-attention to extract object-level details most relevant to the scene from the feature map, thereby enhancing the representation. In SLFA, we first use CLIP’s textual encoder to provide category-level textual features for the scene, guiding the aggregation of corresponding visual features from the feature map. OLFA’s enhanced scene feature then queries these category-aware features using scene-level cross-attention to further capture scene-level information and obtain the final scene representation. To strengthen training, we employ a multi-loss strategy inspired by contrastive learning, improving feature robustness and discriminative ability. We evaluate OSFA on three challenging datasets (i.e. Places365, MIT67, and SUN397), achieving substantial improvements in classification accuracy. These results highlight the effectiveness of our method in enhancing scene feature representation through CLIP-guided aggregation. This advancement significantly improves scene recognition performance. Our code is public at <span><span>https://github.com/WangqunQAQ/OSFA</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50367,"journal":{"name":"Information Fusion","volume":"120 ","pages":"Article 103118"},"PeriodicalIF":14.7000,"publicationDate":"2025-03-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Object-Level and Scene-Level Feature Aggregation with CLIP for scene recognition\",\"authors\":\"Qun Wang , Feng Zhu , Ge Wu , Pengfei Zhao , Jianyu Wang , Xiang Li\",\"doi\":\"10.1016/j.inffus.2025.103118\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Scene recognition is a fundamental task in computer vision, pivotal for applications like visual navigation and robotics. However, traditional methods struggle to effectively capture and aggregate scene-related features due to the inherent complexity and diversity of scenes, often leading to sub-optimal performance. To address this limitation, we propose a novel method, named OSFA (<strong>O</strong>bject-level and <strong>S</strong>cene-level <strong>F</strong>eature <strong>A</strong>ggregation), that leverages CLIP’s multimodal strengths to enhance scene feature representation through a two-stage aggregation strategy: Object-Level Feature Aggregation (OLFA) and Scene-Level Feature Aggregation (SLFA). In OLFA, we first generate an initial scene feature by integrating the average-pooled feature map of the base visual encoder and the CLIP visual feature. The initial scene feature is then used as a query in object-level cross-attention to extract object-level details most relevant to the scene from the feature map, thereby enhancing the representation. In SLFA, we first use CLIP’s textual encoder to provide category-level textual features for the scene, guiding the aggregation of corresponding visual features from the feature map. OLFA’s enhanced scene feature then queries these category-aware features using scene-level cross-attention to further capture scene-level information and obtain the final scene representation. To strengthen training, we employ a multi-loss strategy inspired by contrastive learning, improving feature robustness and discriminative ability. We evaluate OSFA on three challenging datasets (i.e. Places365, MIT67, and SUN397), achieving substantial improvements in classification accuracy. These results highlight the effectiveness of our method in enhancing scene feature representation through CLIP-guided aggregation. This advancement significantly improves scene recognition performance. Our code is public at <span><span>https://github.com/WangqunQAQ/OSFA</span><svg><path></path></svg></span>.</div></div>\",\"PeriodicalId\":50367,\"journal\":{\"name\":\"Information Fusion\",\"volume\":\"120 \",\"pages\":\"Article 103118\"},\"PeriodicalIF\":14.7000,\"publicationDate\":\"2025-03-20\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Information Fusion\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S1566253525001915\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information Fusion","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1566253525001915","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
Object-Level and Scene-Level Feature Aggregation with CLIP for scene recognition
Scene recognition is a fundamental task in computer vision, pivotal for applications like visual navigation and robotics. However, traditional methods struggle to effectively capture and aggregate scene-related features due to the inherent complexity and diversity of scenes, often leading to sub-optimal performance. To address this limitation, we propose a novel method, named OSFA (Object-level and Scene-level Feature Aggregation), that leverages CLIP’s multimodal strengths to enhance scene feature representation through a two-stage aggregation strategy: Object-Level Feature Aggregation (OLFA) and Scene-Level Feature Aggregation (SLFA). In OLFA, we first generate an initial scene feature by integrating the average-pooled feature map of the base visual encoder and the CLIP visual feature. The initial scene feature is then used as a query in object-level cross-attention to extract object-level details most relevant to the scene from the feature map, thereby enhancing the representation. In SLFA, we first use CLIP’s textual encoder to provide category-level textual features for the scene, guiding the aggregation of corresponding visual features from the feature map. OLFA’s enhanced scene feature then queries these category-aware features using scene-level cross-attention to further capture scene-level information and obtain the final scene representation. To strengthen training, we employ a multi-loss strategy inspired by contrastive learning, improving feature robustness and discriminative ability. We evaluate OSFA on three challenging datasets (i.e. Places365, MIT67, and SUN397), achieving substantial improvements in classification accuracy. These results highlight the effectiveness of our method in enhancing scene feature representation through CLIP-guided aggregation. This advancement significantly improves scene recognition performance. Our code is public at https://github.com/WangqunQAQ/OSFA.
期刊介绍:
Information Fusion serves as a central platform for showcasing advancements in multi-sensor, multi-source, multi-process information fusion, fostering collaboration among diverse disciplines driving its progress. It is the leading outlet for sharing research and development in this field, focusing on architectures, algorithms, and applications. Papers dealing with fundamental theoretical analyses as well as those demonstrating their application to real-world problems will be welcome.