通过新型统一注意力网络促进遥感中的跨模态检索

IF 6 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Neural Networks Pub Date : 2024-09-11 DOI:10.1016/j.neunet.2024.106718

{"title":"通过新型统一注意力网络促进遥感中的跨模态检索","authors":"","doi":"10.1016/j.neunet.2024.106718","DOIUrl":null,"url":null,"abstract":"<div><p>With the rapid advent and abundance of remote sensing data in different modalities, cross-modal retrieval tasks have gained importance in the research community. Cross-modal retrieval belongs to the research paradigm in which the query is of one modality and the retrieved output is of the other modality. In this paper, the remote sensing (RS) data modalities considered are the earth observation optical data (aerial photos) and the corresponding hand-drawn sketches. The main challenge of the cross-modal retrieval research objective for optical remote sensing images and the corresponding sketches is the distribution gap between the shared embedding space of the modalities. Prior attempts to resolve this issue have not yielded satisfactory outcomes regarding accurately retrieving cross-modal sketch-image RS data. The state-of-the-art architectures used conventional convolutional architectures, which focused on local pixel-wise information about the modalities to be retrieved. This limits the interaction between the sketch texture and the corresponding image, making these models susceptible to overfitting datasets with particular scenarios. To circumvent this limitation, we suggest establishing multi-modal correspondence using a novel architecture of the combined self and cross-attention algorithms, <span>SPCA-Net</span> to minimize the modality gap by employing attention mechanisms for the query and other modalities. Efficient cross-modal retrieval is achieved through the suggested attention architecture, which empirically emphasizes the global information of the relevant query modality and bridges the domain gap through a unique pairwise cross-attention network. In addition to the novel architecture, this paper introduces a unique loss function, <em>label-specific supervised contrastive loss</em>, tailored to the intricacies of the task and to enhance the discriminative power of the learned embeddings. Extensive evaluations are conducted on two sketch-image remote sensing datasets, Earth-on-Canvas and RSketch. Under the same experimental conditions, the performance metrics of our proposed model beat the state-of-the-art architectures by significant margins of 16.7%, 18.9%, 33.7%, and 40.9% correspondingly.</p></div>","PeriodicalId":49763,"journal":{"name":"Neural Networks","volume":null,"pages":null},"PeriodicalIF":6.0000,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Boosting cross-modal retrieval in remote sensing via a novel unified attention network\",\"authors\":\"\",\"doi\":\"10.1016/j.neunet.2024.106718\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>With the rapid advent and abundance of remote sensing data in different modalities, cross-modal retrieval tasks have gained importance in the research community. Cross-modal retrieval belongs to the research paradigm in which the query is of one modality and the retrieved output is of the other modality. In this paper, the remote sensing (RS) data modalities considered are the earth observation optical data (aerial photos) and the corresponding hand-drawn sketches. The main challenge of the cross-modal retrieval research objective for optical remote sensing images and the corresponding sketches is the distribution gap between the shared embedding space of the modalities. Prior attempts to resolve this issue have not yielded satisfactory outcomes regarding accurately retrieving cross-modal sketch-image RS data. The state-of-the-art architectures used conventional convolutional architectures, which focused on local pixel-wise information about the modalities to be retrieved. This limits the interaction between the sketch texture and the corresponding image, making these models susceptible to overfitting datasets with particular scenarios. To circumvent this limitation, we suggest establishing multi-modal correspondence using a novel architecture of the combined self and cross-attention algorithms, <span>SPCA-Net</span> to minimize the modality gap by employing attention mechanisms for the query and other modalities. Efficient cross-modal retrieval is achieved through the suggested attention architecture, which empirically emphasizes the global information of the relevant query modality and bridges the domain gap through a unique pairwise cross-attention network. In addition to the novel architecture, this paper introduces a unique loss function, <em>label-specific supervised contrastive loss</em>, tailored to the intricacies of the task and to enhance the discriminative power of the learned embeddings. Extensive evaluations are conducted on two sketch-image remote sensing datasets, Earth-on-Canvas and RSketch. Under the same experimental conditions, the performance metrics of our proposed model beat the state-of-the-art architectures by significant margins of 16.7%, 18.9%, 33.7%, and 40.9% correspondingly.</p></div>\",\"PeriodicalId\":49763,\"journal\":{\"name\":\"Neural Networks\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":6.0000,\"publicationDate\":\"2024-09-11\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Neural Networks\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0893608024006427\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Neural Networks","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0893608024006427","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

随着不同模态遥感数据的迅速出现和丰富，跨模态检索任务在研究界变得越来越重要。跨模态检索属于一种研究范式，即一种模态的查询和另一种模态的检索输出。本文考虑的遥感（RS）数据模式是地球观测光学数据（航空照片）和相应的手绘草图。光学遥感图像和相应草图的跨模态检索研究目标面临的主要挑战是两种模态共享嵌入空间之间的分布差距。在准确检索跨模态草图-图像 RS 数据方面，先前为解决这一问题所做的尝试并未取得令人满意的结果。最先进的架构使用的是传统的卷积架构，这种架构侧重于要检索的模态的局部像素信息。这限制了草图纹理与相应图像之间的交互，使得这些模型容易过度拟合特定场景的数据集。为了规避这一限制，我们建议使用一种新颖的自关注和交叉关注算法组合架构 SPCA-Net 来建立多模态对应关系，通过对查询和其他模态采用关注机制来最小化模态差距。通过所建议的注意架构实现了高效的跨模态检索，该架构根据经验强调了相关查询模态的全局信息，并通过独特的成对交叉注意网络弥合了领域差距。除了新颖的架构外，本文还引入了一个独特的损失函数--特定标签的监督对比损失，以适应任务的复杂性并增强所学嵌入的判别能力。在两个草图图像遥感数据集（Earth-on-Canvas 和 RSketch）上进行了广泛的评估。在相同的实验条件下，我们提出的模型的性能指标分别以 16.7%、18.9%、33.7% 和 40.9% 的显著优势击败了最先进的架构。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Boosting cross-modal retrieval in remote sensing via a novel unified attention network

With the rapid advent and abundance of remote sensing data in different modalities, cross-modal retrieval tasks have gained importance in the research community. Cross-modal retrieval belongs to the research paradigm in which the query is of one modality and the retrieved output is of the other modality. In this paper, the remote sensing (RS) data modalities considered are the earth observation optical data (aerial photos) and the corresponding hand-drawn sketches. The main challenge of the cross-modal retrieval research objective for optical remote sensing images and the corresponding sketches is the distribution gap between the shared embedding space of the modalities. Prior attempts to resolve this issue have not yielded satisfactory outcomes regarding accurately retrieving cross-modal sketch-image RS data. The state-of-the-art architectures used conventional convolutional architectures, which focused on local pixel-wise information about the modalities to be retrieved. This limits the interaction between the sketch texture and the corresponding image, making these models susceptible to overfitting datasets with particular scenarios. To circumvent this limitation, we suggest establishing multi-modal correspondence using a novel architecture of the combined self and cross-attention algorithms, SPCA-Net to minimize the modality gap by employing attention mechanisms for the query and other modalities. Efficient cross-modal retrieval is achieved through the suggested attention architecture, which empirically emphasizes the global information of the relevant query modality and bridges the domain gap through a unique pairwise cross-attention network. In addition to the novel architecture, this paper introduces a unique loss function, label-specific supervised contrastive loss, tailored to the intricacies of the task and to enhance the discriminative power of the learned embeddings. Extensive evaluations are conducted on two sketch-image remote sensing datasets, Earth-on-Canvas and RSketch. Under the same experimental conditions, the performance metrics of our proposed model beat the state-of-the-art architectures by significant margins of 16.7%, 18.9%, 33.7%, and 40.9% correspondingly.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Neural Networks 工程技术-计算机：人工智能

CiteScore

13.90

自引率

7.70%

发文量

425

审稿时长

67 days

期刊介绍： Neural Networks is a platform that aims to foster an international community of scholars and practitioners interested in neural networks, deep learning, and other approaches to artificial intelligence and machine learning. Our journal invites submissions covering various aspects of neural networks research, from computational neuroscience and cognitive modeling to mathematical analyses and engineering applications. By providing a forum for interdisciplinary discussions between biology and technology, we aim to encourage the development of biologically-inspired artificial intelligence.