基于选定状态空间和散列的高效跨视角图像融合方法促进城市感知

IF 14.7 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE
Peng Han , Chao Chen
{"title":"基于选定状态空间和散列的高效跨视角图像融合方法促进城市感知","authors":"Peng Han ,&nbsp;Chao Chen","doi":"10.1016/j.inffus.2024.102737","DOIUrl":null,"url":null,"abstract":"<div><div>In the field of cross-view image geolocation, traditional convolutional neural network (CNN)-based learning models generate unsatisfactory fusion performance due to their inability to model global correlations. The Transformer-based fusion methods can well compensate for the above problems, however, the Transformer has quadratic computational complexity and huge GPU memory consumption. The recent Mamba model based on the selection state space has a strong ability to model long sequences, lower GPU memory occupancy, and fewer GFLOPs. It is thus attractive and worth studying to apply Mamba to the cross-view image geolocation task. In addition, in the image-matching process (i.e., fusion of satellite/aerial and street view data.), we found that the storage occupancy of similarity measures based on floating-point features is high. Efficiently converting floating-point features into hash codes is a possible solution. In this study, we propose a cross-view image geolocation method (S6HG) based purely on Vision Mamba and hashing. S6HG fully utilizes the advantages of Vision Mamba in global information modeling and explicit location information encoding and the low storage occupancy of hash codes. Our method consists of two stages. In the first stage, we use a Siamese network based purely on vision Mamba to embed features for street view images and satellite images respectively. Our first-stage model is called S6G. In the second stage, we construct a cross-view autoencoder to further refine and compress the embedded features, and then simply map the refined features to hash codes. Comprehensive experiments show that S6G has achieved superior results on the CVACT dataset and comparable results to the most advanced methods on the CVUSA dataset. It is worth noting that other floating-point feature-based methods (4096-dimension) are 170.59 times faster than S6HG (768-bit) in storing 90,618 retrieval gallery data. Furthermore, the inference efficiency of S6G is higher than ViT-based computational methods.</div></div>","PeriodicalId":50367,"journal":{"name":"Information Fusion","volume":"115 ","pages":"Article 102737"},"PeriodicalIF":14.7000,"publicationDate":"2024-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"An efficient cross-view image fusion method based on selected state space and hashing for promoting urban perception\",\"authors\":\"Peng Han ,&nbsp;Chao Chen\",\"doi\":\"10.1016/j.inffus.2024.102737\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>In the field of cross-view image geolocation, traditional convolutional neural network (CNN)-based learning models generate unsatisfactory fusion performance due to their inability to model global correlations. The Transformer-based fusion methods can well compensate for the above problems, however, the Transformer has quadratic computational complexity and huge GPU memory consumption. The recent Mamba model based on the selection state space has a strong ability to model long sequences, lower GPU memory occupancy, and fewer GFLOPs. It is thus attractive and worth studying to apply Mamba to the cross-view image geolocation task. In addition, in the image-matching process (i.e., fusion of satellite/aerial and street view data.), we found that the storage occupancy of similarity measures based on floating-point features is high. Efficiently converting floating-point features into hash codes is a possible solution. In this study, we propose a cross-view image geolocation method (S6HG) based purely on Vision Mamba and hashing. S6HG fully utilizes the advantages of Vision Mamba in global information modeling and explicit location information encoding and the low storage occupancy of hash codes. Our method consists of two stages. In the first stage, we use a Siamese network based purely on vision Mamba to embed features for street view images and satellite images respectively. Our first-stage model is called S6G. In the second stage, we construct a cross-view autoencoder to further refine and compress the embedded features, and then simply map the refined features to hash codes. Comprehensive experiments show that S6G has achieved superior results on the CVACT dataset and comparable results to the most advanced methods on the CVUSA dataset. It is worth noting that other floating-point feature-based methods (4096-dimension) are 170.59 times faster than S6HG (768-bit) in storing 90,618 retrieval gallery data. Furthermore, the inference efficiency of S6G is higher than ViT-based computational methods.</div></div>\",\"PeriodicalId\":50367,\"journal\":{\"name\":\"Information Fusion\",\"volume\":\"115 \",\"pages\":\"Article 102737\"},\"PeriodicalIF\":14.7000,\"publicationDate\":\"2024-10-15\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Information Fusion\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S1566253524005153\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information Fusion","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1566253524005153","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0

摘要

在跨视角图像地理定位领域,传统的基于卷积神经网络(CNN)的学习模型由于无法建立全局相关性模型,其融合性能并不理想。基于变换器的融合方法可以很好地弥补上述问题,但变换器具有二次计算复杂性和巨大的 GPU 内存消耗。最近推出的基于选择状态空间的 Mamba 模型具有很强的长序列建模能力、较低的 GPU 内存占用和较少的 GFLOP。因此,将 Mamba 应用于跨视角图像地理定位任务是非常有吸引力和值得研究的。此外,在图像匹配过程(即卫星/航拍和街景数据的融合)中,我们发现基于浮点特征的相似性度量的存储占用率很高。有效地将浮点特征转换成哈希代码是一个可行的解决方案。在本研究中,我们提出了一种纯粹基于 Vision Mamba 和散列的跨视图图像地理定位方法(S6HG)。S6HG 充分利用了 Vision Mamba 在全局信息建模和显式位置信息编码方面的优势,以及散列码的低存储占用率。我们的方法包括两个阶段。在第一阶段,我们使用纯粹基于视觉 Mamba 的连体网络,分别为街景图像和卫星图像嵌入特征。我们的第一阶段模型称为 S6G。在第二阶段,我们构建了一个跨视图自动编码器来进一步细化和压缩嵌入的特征,然后将细化后的特征简单地映射为哈希代码。综合实验表明,S6G 在 CVACT 数据集上取得了优异的结果,在 CVUSA 数据集上的结果与最先进的方法不相上下。值得注意的是,在存储 90,618 个检索图库数据时,其他基于浮点特征的方法(4096 维)比 S6HG(768 位)快 170.59 倍。此外,S6G 的推理效率也高于基于 ViT 的计算方法。
本文章由计算机程序翻译,如有差异,请以英文原文为准。

An efficient cross-view image fusion method based on selected state space and hashing for promoting urban perception

An efficient cross-view image fusion method based on selected state space and hashing for promoting urban perception
In the field of cross-view image geolocation, traditional convolutional neural network (CNN)-based learning models generate unsatisfactory fusion performance due to their inability to model global correlations. The Transformer-based fusion methods can well compensate for the above problems, however, the Transformer has quadratic computational complexity and huge GPU memory consumption. The recent Mamba model based on the selection state space has a strong ability to model long sequences, lower GPU memory occupancy, and fewer GFLOPs. It is thus attractive and worth studying to apply Mamba to the cross-view image geolocation task. In addition, in the image-matching process (i.e., fusion of satellite/aerial and street view data.), we found that the storage occupancy of similarity measures based on floating-point features is high. Efficiently converting floating-point features into hash codes is a possible solution. In this study, we propose a cross-view image geolocation method (S6HG) based purely on Vision Mamba and hashing. S6HG fully utilizes the advantages of Vision Mamba in global information modeling and explicit location information encoding and the low storage occupancy of hash codes. Our method consists of two stages. In the first stage, we use a Siamese network based purely on vision Mamba to embed features for street view images and satellite images respectively. Our first-stage model is called S6G. In the second stage, we construct a cross-view autoencoder to further refine and compress the embedded features, and then simply map the refined features to hash codes. Comprehensive experiments show that S6G has achieved superior results on the CVACT dataset and comparable results to the most advanced methods on the CVUSA dataset. It is worth noting that other floating-point feature-based methods (4096-dimension) are 170.59 times faster than S6HG (768-bit) in storing 90,618 retrieval gallery data. Furthermore, the inference efficiency of S6G is higher than ViT-based computational methods.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Information Fusion
Information Fusion 工程技术-计算机:理论方法
CiteScore
33.20
自引率
4.30%
发文量
161
审稿时长
7.9 months
期刊介绍: Information Fusion serves as a central platform for showcasing advancements in multi-sensor, multi-source, multi-process information fusion, fostering collaboration among diverse disciplines driving its progress. It is the leading outlet for sharing research and development in this field, focusing on architectures, algorithms, and applications. Papers dealing with fundamental theoretical analyses as well as those demonstrating their application to real-world problems will be welcome.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信