Prompt-Based Granularity-Unified Representation Network for Remote Sensing Image-Text Matching

IF 4.7 2区地球科学 Q1 ENGINEERING, ELECTRICAL & ELECTRONIC

IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing Pub Date : 2025-03-28 DOI:10.1109/JSTARS.2025.3555639

Minhan Hu;Keke Yang;Jing Li

{"title":"Prompt-Based Granularity-Unified Representation Network for Remote Sensing Image-Text Matching","authors":"Minhan Hu;Keke Yang;Jing Li","doi":"10.1109/JSTARS.2025.3555639","DOIUrl":null,"url":null,"abstract":"Remote sensing (RS) image–text matching has gained significant attention for its promising potential. Despite great advancements, accurately matching RS images (RSIs) and captions remains challenging due to the significant multimodal gap and inherent characteristics of RS data. Many approaches use complex models to extract global features to handle semantic redundancy and varying scales in RSIs, but losing important details in RSIs and captions. While some methods align between fine-grained local features, but overlooking the semantic granularity differences between fine-grained features. Fine-grained features in RSIs typically capture only a small fraction of the overall semantics, whereas those in captions convey more comprehensive and abstract semantics. Therefore, we propose the prompt-based granularity-unified representation network, an end-to-end framework designed to mitigate the multimodal semantic granularity difference and achieve comprehensive alignment. Our approach includes two key modules: 1) the prompt-based feature aggregator, which dynamically aggregates fine-grained features into several granularity-unified tokens with fully semantic, and 2) the text-guided vision modulation, which further enhances visual representations by modulating the visual features with RS captions as language typically contains more precise semantic than visual data. Furthermore, to address the challenges posed by high similarity in RS datasets, we introduce an effective hybrid cross-modal loss that facilitates comprehensive multimodal feature alignment within a unified structure. We conduct extensive experiments on three benchmark datasets, achieving state-of-the-art performance, which validates the effectiveness and superiority of our method.","PeriodicalId":13116,"journal":{"name":"IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing","volume":"18 ","pages":"10172-10185"},"PeriodicalIF":4.7000,"publicationDate":"2025-03-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10945411","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10945411/","RegionNum":2,"RegionCategory":"地球科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}

引用次数: 0

Abstract

Remote sensing (RS) image–text matching has gained significant attention for its promising potential. Despite great advancements, accurately matching RS images (RSIs) and captions remains challenging due to the significant multimodal gap and inherent characteristics of RS data. Many approaches use complex models to extract global features to handle semantic redundancy and varying scales in RSIs, but losing important details in RSIs and captions. While some methods align between fine-grained local features, but overlooking the semantic granularity differences between fine-grained features. Fine-grained features in RSIs typically capture only a small fraction of the overall semantics, whereas those in captions convey more comprehensive and abstract semantics. Therefore, we propose the prompt-based granularity-unified representation network, an end-to-end framework designed to mitigate the multimodal semantic granularity difference and achieve comprehensive alignment. Our approach includes two key modules: 1) the prompt-based feature aggregator, which dynamically aggregates fine-grained features into several granularity-unified tokens with fully semantic, and 2) the text-guided vision modulation, which further enhances visual representations by modulating the visual features with RS captions as language typically contains more precise semantic than visual data. Furthermore, to address the challenges posed by high similarity in RS datasets, we introduce an effective hybrid cross-modal loss that facilitates comprehensive multimodal feature alignment within a unified structure. We conduct extensive experiments on three benchmark datasets, achieving state-of-the-art performance, which validates the effectiveness and superiority of our method.

查看原文本刊更多论文

基于提示的遥感图像文本匹配粒度统一表示网络

遥感图像-文本匹配以其广阔的应用前景受到了广泛的关注。尽管取得了很大的进步，但由于RS数据的显著多模态差距和固有特征，准确匹配RS图像和字幕仍然具有挑战性。许多方法使用复杂的模型来提取全局特征，以处理语义冗余和rsi中的尺度变化，但在rsi和标题中丢失了重要的细节。虽然有些方法在细粒度的局部特征之间进行对齐，但忽略了细粒度特征之间的语义粒度差异。rsi中的细粒度特征通常只捕获整体语义的一小部分，而标题中的特征则传达更全面和抽象的语义。因此，我们提出了基于提示的粒度统一表示网络，这是一个端到端的框架，旨在缓解多模态语义粒度差异，实现全面对齐。我们的方法包括两个关键模块：1)基于提示的特征聚合器，它动态地将细粒度特征聚合成几个粒度统一的全语义标记；2)文本引导的视觉调制，它通过使用RS标题调制视觉特征来进一步增强视觉表征，因为语言通常比视觉数据包含更精确的语义。此外，为了解决RS数据集高相似性带来的挑战，我们引入了一种有效的混合跨模态损失，有助于在统一结构内实现全面的多模态特征对齐。我们在三个基准数据集上进行了广泛的实验，取得了最先进的性能，验证了我们方法的有效性和优越性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 地学-成像科学与照相技术

CiteScore

9.30

自引率

10.90%

发文量

563

审稿时长

4.7 months

期刊介绍： The IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing addresses the growing field of applications in Earth observations and remote sensing, and also provides a venue for the rapidly expanding special issues that are being sponsored by the IEEE Geosciences and Remote Sensing Society. The journal draws upon the experience of the highly successful “IEEE Transactions on Geoscience and Remote Sensing” and provide a complementary medium for the wide range of topics in applied earth observations. The ‘Applications’ areas encompasses the societal benefit areas of the Global Earth Observations Systems of Systems (GEOSS) program. Through deliberations over two years, ministers from 50 countries agreed to identify nine areas where Earth observation could positively impact the quality of life and health of their respective countries. Some of these are areas not traditionally addressed in the IEEE context. These include biodiversity, health and climate. Yet it is the skill sets of IEEE members, in areas such as observations, communications, computers, signal processing, standards and ocean engineering, that form the technical underpinnings of GEOSS. Thus, the Journal attracts a broad range of interests that serves both present members in new ways and expands the IEEE visibility into new areas.