{"title":"AirGeoNet: A Map-Guided Visual Geo-Localization Approach for Aerial Vehicles","authors":"Xiangze Meng;Wulong Guo;Kai Zhou;Ting Sun;Lei Deng;Shijie Yu;Yuhao Feng","doi":"10.1109/TGRS.2024.3456912","DOIUrl":null,"url":null,"abstract":"Aerial vehicles (AVs) commonly operate in vast environments, presenting a persistent challenge in achieving high-precision localization. The contemporary popular global positioning methods have their inherent limitations. For instance, the precision of GPS is susceptible to decline or even complete failure when the signal is disrupted or absent. Furthermore, the precision of image retrieval techniques is inadequate. The construction of 3-D models is a time-consuming and storage-intensive endeavor. In addition, scene coordinate regression necessitates retraining to adapt to varying scenarios, which presents challenges when attempting to generalize across expansive environments. Addressing these challenges, we propose a network named AirGeoNet, which integrates satellite images and semantic maps to achieve high-precision efficient localization. In the first phase, we introduce the foundation model DINOV2 to extract features from satellite and aerial images, employ a vector of locally aggregated descriptor (VLAD) for image retrieval to get coarse position, and, finally, significantly enhance retrieval accuracy by combining sequential images with particle filters. Subsequently, AirGeoNet matches aerial images with semantic maps to determine the three degrees of freedom in pose, including position and orientation. The semantic maps utilized by AirGeoNet are sourced from OpenStreetMap and our self-produced QMap, and training is conducted in a supervised manner using real camera poses. Our AirGeoNet method is highly efficient, requiring only a 1546-D feature vector per image for image retrieval and 240k storage for a 0.9-\n<inline-formula> <tex-math>$\\text {km}^{2}$ </tex-math></inline-formula>\n semantic map while achieving state-of-the-art accuracy with single-frame localization errors of 2.854 m on semantically rich datasets and 11 m in complex scenarios. Our code is publicly available at \n<uri>https://github.com/mxz520mxz/AirGeoNet.git</uri>","PeriodicalId":13213,"journal":{"name":"IEEE Transactions on Geoscience and Remote Sensing","volume":null,"pages":null},"PeriodicalIF":7.5000,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Geoscience and Remote Sensing","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10671570/","RegionNum":1,"RegionCategory":"地球科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
引用次数: 0
Abstract
Aerial vehicles (AVs) commonly operate in vast environments, presenting a persistent challenge in achieving high-precision localization. The contemporary popular global positioning methods have their inherent limitations. For instance, the precision of GPS is susceptible to decline or even complete failure when the signal is disrupted or absent. Furthermore, the precision of image retrieval techniques is inadequate. The construction of 3-D models is a time-consuming and storage-intensive endeavor. In addition, scene coordinate regression necessitates retraining to adapt to varying scenarios, which presents challenges when attempting to generalize across expansive environments. Addressing these challenges, we propose a network named AirGeoNet, which integrates satellite images and semantic maps to achieve high-precision efficient localization. In the first phase, we introduce the foundation model DINOV2 to extract features from satellite and aerial images, employ a vector of locally aggregated descriptor (VLAD) for image retrieval to get coarse position, and, finally, significantly enhance retrieval accuracy by combining sequential images with particle filters. Subsequently, AirGeoNet matches aerial images with semantic maps to determine the three degrees of freedom in pose, including position and orientation. The semantic maps utilized by AirGeoNet are sourced from OpenStreetMap and our self-produced QMap, and training is conducted in a supervised manner using real camera poses. Our AirGeoNet method is highly efficient, requiring only a 1546-D feature vector per image for image retrieval and 240k storage for a 0.9-
$\text {km}^{2}$
semantic map while achieving state-of-the-art accuracy with single-frame localization errors of 2.854 m on semantically rich datasets and 11 m in complex scenarios. Our code is publicly available at
https://github.com/mxz520mxz/AirGeoNet.git
期刊介绍:
IEEE Transactions on Geoscience and Remote Sensing (TGRS) is a monthly publication that focuses on the theory, concepts, and techniques of science and engineering as applied to sensing the land, oceans, atmosphere, and space; and the processing, interpretation, and dissemination of this information.