{"title":"Semantic Concept Perception Network With Interactive Prompting for Cross-View Image Geo-Localization","authors":"Yuan Gao;Haibo Liu;Xiaohui Wei","doi":"10.1109/TCSVT.2025.3533574","DOIUrl":null,"url":null,"abstract":"Cross-view image geo-localization aims to estimate the geographic position of a query image from the ground platform (such as mobile phone, vehicle camera) by matching it with geo-tagged reference images from the aerial platform (such as drone, satellite). Although existing studies have achieved promising results, they usually rely only on depth features and fail to effectively handle the serious changes in geometric shape and appearance caused by view differences. In this paper, a novel Semantic Concept Perception Network (SCPNet) with interactive prompting is proposed, whose core is to extract and integrate semantic concept information reflecting spatial position relationship between objects. Specifically, for a given of pair input images, a CNN stem with positional embedding is first adopted to extract depth features. Meanwhile, a semantic concept mining module is designed to distinguish different objects and capture the associations between them, thereby achieving the purpose of extracting semantic concept information. Furthermore, to obtain global descriptions of different views, a feature bidirectional injection fusion module based on attention mechanism is proposed to exploit the long-range dependencies of semantic concept and depth features. Finally, a triplet loss with a flexible hard sample mining strategy is used to guide the optimization of the network. Experimental results have shown that our proposed method can achieve better performance compared with state-of-the-art methods on mainstream cross-view datasets.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 6","pages":"5343-5354"},"PeriodicalIF":11.1000,"publicationDate":"2025-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Circuits and Systems for Video Technology","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10852334/","RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
引用次数: 0
Abstract
Cross-view image geo-localization aims to estimate the geographic position of a query image from the ground platform (such as mobile phone, vehicle camera) by matching it with geo-tagged reference images from the aerial platform (such as drone, satellite). Although existing studies have achieved promising results, they usually rely only on depth features and fail to effectively handle the serious changes in geometric shape and appearance caused by view differences. In this paper, a novel Semantic Concept Perception Network (SCPNet) with interactive prompting is proposed, whose core is to extract and integrate semantic concept information reflecting spatial position relationship between objects. Specifically, for a given of pair input images, a CNN stem with positional embedding is first adopted to extract depth features. Meanwhile, a semantic concept mining module is designed to distinguish different objects and capture the associations between them, thereby achieving the purpose of extracting semantic concept information. Furthermore, to obtain global descriptions of different views, a feature bidirectional injection fusion module based on attention mechanism is proposed to exploit the long-range dependencies of semantic concept and depth features. Finally, a triplet loss with a flexible hard sample mining strategy is used to guide the optimization of the network. Experimental results have shown that our proposed method can achieve better performance compared with state-of-the-art methods on mainstream cross-view datasets.
期刊介绍:
The IEEE Transactions on Circuits and Systems for Video Technology (TCSVT) is dedicated to covering all aspects of video technologies from a circuits and systems perspective. We encourage submissions of general, theoretical, and application-oriented papers related to image and video acquisition, representation, presentation, and display. Additionally, we welcome contributions in areas such as processing, filtering, and transforms; analysis and synthesis; learning and understanding; compression, transmission, communication, and networking; as well as storage, retrieval, indexing, and search. Furthermore, papers focusing on hardware and software design and implementation are highly valued. Join us in advancing the field of video technology through innovative research and insights.