Transcending Fusion: A Multiscale Alignment Method for Remote Sensing Image–Text Retrieval

IF 7.5 1区地球科学 Q1 ENGINEERING, ELECTRICAL & ELECTRONIC

IEEE Transactions on Geoscience and Remote Sensing Pub Date : 2024-11-19 DOI:10.1109/TGRS.2024.3496898

Rui Yang;Shuang Wang;Yingping Han;Yuanheng Li;Dong Zhao;Dou Quan;Yanhe Guo;Licheng Jiao;Zhi Yang

{"title":"Transcending Fusion: A Multiscale Alignment Method for Remote Sensing Image–Text Retrieval","authors":"Rui Yang;Shuang Wang;Yingping Han;Yuanheng Li;Dong Zhao;Dou Quan;Yanhe Guo;Licheng Jiao;Zhi Yang","doi":"10.1109/TGRS.2024.3496898","DOIUrl":null,"url":null,"abstract":"Remote sensing image-text retrieval (RSITR) is pivotal for knowledge services and data mining in the remote sensing (RS) domain. Considering the multiscale representations in image content and text vocabulary can enable the models to learn richer representations and enhance retrieval. Current multiscale RSITR approaches typically align multiscale fused image features with text features but overlook aligning image-text pairs at distinct scales separately. This oversight restricts their ability to learn joint representations suitable for effective retrieval. We introduce a novel multiscale alignment (MSA) method to overcome this limitation. Our method comprises three key innovations: 1) a multiscale cross-modal alignment transformer (MSCMAT), which computes cross-attention between single-scale image features and localized text features, integrating global textual context to derive a matching score matrix within a mini-batch; 2) a multiscale cross-modal semantic alignment loss (MSCMA loss) that enforces semantic alignment across scales; and 3) a cross-scale multimodal semantic consistency loss (CSMMC loss) that uses the matching matrix from the largest scale to guide alignment at smaller scales. We evaluated our method across multiple datasets, demonstrating its efficacy with various visual backbones and establishing its superiority over existing state-of-the-art methods. The GitHub URL for our project is \n<uri>https://github.com/yr666666/MSA</uri>\n.","PeriodicalId":13213,"journal":{"name":"IEEE Transactions on Geoscience and Remote Sensing","volume":"62 ","pages":"1-17"},"PeriodicalIF":7.5000,"publicationDate":"2024-11-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Geoscience and Remote Sensing","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10758255/","RegionNum":1,"RegionCategory":"地球科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}

引用次数: 0

Abstract

Remote sensing image-text retrieval (RSITR) is pivotal for knowledge services and data mining in the remote sensing (RS) domain. Considering the multiscale representations in image content and text vocabulary can enable the models to learn richer representations and enhance retrieval. Current multiscale RSITR approaches typically align multiscale fused image features with text features but overlook aligning image-text pairs at distinct scales separately. This oversight restricts their ability to learn joint representations suitable for effective retrieval. We introduce a novel multiscale alignment (MSA) method to overcome this limitation. Our method comprises three key innovations: 1) a multiscale cross-modal alignment transformer (MSCMAT), which computes cross-attention between single-scale image features and localized text features, integrating global textual context to derive a matching score matrix within a mini-batch; 2) a multiscale cross-modal semantic alignment loss (MSCMA loss) that enforces semantic alignment across scales; and 3) a cross-scale multimodal semantic consistency loss (CSMMC loss) that uses the matching matrix from the largest scale to guide alignment at smaller scales. We evaluated our method across multiple datasets, demonstrating its efficacy with various visual backbones and establishing its superiority over existing state-of-the-art methods. The GitHub URL for our project is https://github.com/yr666666/MSA .

查看原文本刊更多论文

超越融合：遥感图像-文本检索的多尺度对齐方法

遥感图像-文本检索（RSITR）对于遥感（RS）领域的知识服务和数据挖掘至关重要。考虑图像内容和文本词汇中的多尺度表征可以使模型学习到更丰富的表征并提高检索效率。目前的多尺度 RSITR 方法通常将多尺度融合图像特征与文本特征进行对齐，但忽略了将不同尺度的图像-文本对分别进行对齐。这种疏忽限制了它们学习适合有效检索的联合表征的能力。我们引入了一种新颖的多尺度对齐（MSA）方法来克服这一局限。我们的方法包括三项关键创新：1）多尺度跨模态配准转换器（MSCMAT），计算单尺度图像特征和局部文本特征之间的交叉注意，整合全局文本上下文，在一个小批量中得出匹配得分矩阵；2）多尺度跨模态语义配准损失（MSCMA loss），强制跨尺度语义配准；3）跨尺度多模态语义一致性损失（CSMMC loss），使用最大尺度的匹配矩阵指导较小尺度的配准。我们在多个数据集上评估了我们的方法，证明了它在各种视觉骨干上的有效性，并确立了它优于现有先进方法的地位。我们项目的 GitHub 网址是 https://github.com/yr666666/MSA。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Transactions on Geoscience and Remote Sensing 工程技术-地球化学与地球物理

CiteScore

11.50

自引率

28.00%

发文量

1912

审稿时长

4.0 months

期刊介绍： IEEE Transactions on Geoscience and Remote Sensing (TGRS) is a monthly publication that focuses on the theory, concepts, and techniques of science and engineering as applied to sensing the land, oceans, atmosphere, and space; and the processing, interpretation, and dissemination of this information.