Semantic Concept Perception Network With Interactive Prompting for Cross-View Image Geo-Localization

IF 11.1 1区工程技术 Q1 ENGINEERING, ELECTRICAL & ELECTRONIC

IEEE Transactions on Circuits and Systems for Video Technology Pub Date : 2025-01-24 DOI:10.1109/TCSVT.2025.3533574

Yuan Gao;Haibo Liu;Xiaohui Wei

{"title":"Semantic Concept Perception Network With Interactive Prompting for Cross-View Image Geo-Localization","authors":"Yuan Gao;Haibo Liu;Xiaohui Wei","doi":"10.1109/TCSVT.2025.3533574","DOIUrl":null,"url":null,"abstract":"Cross-view image geo-localization aims to estimate the geographic position of a query image from the ground platform (such as mobile phone, vehicle camera) by matching it with geo-tagged reference images from the aerial platform (such as drone, satellite). Although existing studies have achieved promising results, they usually rely only on depth features and fail to effectively handle the serious changes in geometric shape and appearance caused by view differences. In this paper, a novel Semantic Concept Perception Network (SCPNet) with interactive prompting is proposed, whose core is to extract and integrate semantic concept information reflecting spatial position relationship between objects. Specifically, for a given of pair input images, a CNN stem with positional embedding is first adopted to extract depth features. Meanwhile, a semantic concept mining module is designed to distinguish different objects and capture the associations between them, thereby achieving the purpose of extracting semantic concept information. Furthermore, to obtain global descriptions of different views, a feature bidirectional injection fusion module based on attention mechanism is proposed to exploit the long-range dependencies of semantic concept and depth features. Finally, a triplet loss with a flexible hard sample mining strategy is used to guide the optimization of the network. Experimental results have shown that our proposed method can achieve better performance compared with state-of-the-art methods on mainstream cross-view datasets.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 6","pages":"5343-5354"},"PeriodicalIF":11.1000,"publicationDate":"2025-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Circuits and Systems for Video Technology","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10852334/","RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}

引用次数: 0

Abstract

Cross-view image geo-localization aims to estimate the geographic position of a query image from the ground platform (such as mobile phone, vehicle camera) by matching it with geo-tagged reference images from the aerial platform (such as drone, satellite). Although existing studies have achieved promising results, they usually rely only on depth features and fail to effectively handle the serious changes in geometric shape and appearance caused by view differences. In this paper, a novel Semantic Concept Perception Network (SCPNet) with interactive prompting is proposed, whose core is to extract and integrate semantic concept information reflecting spatial position relationship between objects. Specifically, for a given of pair input images, a CNN stem with positional embedding is first adopted to extract depth features. Meanwhile, a semantic concept mining module is designed to distinguish different objects and capture the associations between them, thereby achieving the purpose of extracting semantic concept information. Furthermore, to obtain global descriptions of different views, a feature bidirectional injection fusion module based on attention mechanism is proposed to exploit the long-range dependencies of semantic concept and depth features. Finally, a triplet loss with a flexible hard sample mining strategy is used to guide the optimization of the network. Experimental results have shown that our proposed method can achieve better performance compared with state-of-the-art methods on mainstream cross-view datasets.

查看原文本刊更多论文

基于交互式提示的跨视图像地理定位语义概念感知网络

交叉视点图像地理定位的目的是通过将地面平台（如手机、车载摄像头）的查询图像与空中平台（如无人机、卫星）的带有地理标记的参考图像进行匹配，估计查询图像的地理位置。现有的研究虽然取得了可喜的成果，但往往只依赖于深度特征，未能有效处理因视差引起的几何形状和外观的严重变化。本文提出了一种具有交互提示的语义概念感知网络（Semantic Concept Perception Network, SCPNet），其核心是提取和整合反映物体间空间位置关系的语义概念信息。具体而言，对于给定的一对输入图像，首先采用位置嵌入的CNN干提取深度特征。同时，设计了语义概念挖掘模块，对不同对象进行区分并捕捉它们之间的关联，从而达到提取语义概念信息的目的。此外，为了获得不同视图的全局描述，提出了一种基于注意机制的特征双向注入融合模块，利用语义概念和深度特征之间的远程依赖关系。最后，利用三重损失和灵活的硬样本挖掘策略来指导网络的优化。实验结果表明，在主流的交叉视图数据集上，与现有的方法相比，我们的方法可以取得更好的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Transactions on Circuits and Systems for Video Technology 工程技术-工程：电子与电气

CiteScore

13.80

自引率

27.40%

发文量

660

审稿时长

5 months

期刊介绍： The IEEE Transactions on Circuits and Systems for Video Technology (TCSVT) is dedicated to covering all aspects of video technologies from a circuits and systems perspective. We encourage submissions of general, theoretical, and application-oriented papers related to image and video acquisition, representation, presentation, and display. Additionally, we welcome contributions in areas such as processing, filtering, and transforms; analysis and synthesis; learning and understanding; compression, transmission, communication, and networking; as well as storage, retrieval, indexing, and search. Furthermore, papers focusing on hardware and software design and implementation are highly valued. Join us in advancing the field of video technology through innovative research and insights.