Transcending Fusion: A Multiscale Alignment Method for Remote Sensing Image–Text Retrieval

IF 7.5 1区 地球科学 Q1 ENGINEERING, ELECTRICAL & ELECTRONIC
Rui Yang;Shuang Wang;Yingping Han;Yuanheng Li;Dong Zhao;Dou Quan;Yanhe Guo;Licheng Jiao;Zhi Yang
{"title":"Transcending Fusion: A Multiscale Alignment Method for Remote Sensing Image–Text Retrieval","authors":"Rui Yang;Shuang Wang;Yingping Han;Yuanheng Li;Dong Zhao;Dou Quan;Yanhe Guo;Licheng Jiao;Zhi Yang","doi":"10.1109/TGRS.2024.3496898","DOIUrl":null,"url":null,"abstract":"Remote sensing image-text retrieval (RSITR) is pivotal for knowledge services and data mining in the remote sensing (RS) domain. Considering the multiscale representations in image content and text vocabulary can enable the models to learn richer representations and enhance retrieval. Current multiscale RSITR approaches typically align multiscale fused image features with text features but overlook aligning image-text pairs at distinct scales separately. This oversight restricts their ability to learn joint representations suitable for effective retrieval. We introduce a novel multiscale alignment (MSA) method to overcome this limitation. Our method comprises three key innovations: 1) a multiscale cross-modal alignment transformer (MSCMAT), which computes cross-attention between single-scale image features and localized text features, integrating global textual context to derive a matching score matrix within a mini-batch; 2) a multiscale cross-modal semantic alignment loss (MSCMA loss) that enforces semantic alignment across scales; and 3) a cross-scale multimodal semantic consistency loss (CSMMC loss) that uses the matching matrix from the largest scale to guide alignment at smaller scales. We evaluated our method across multiple datasets, demonstrating its efficacy with various visual backbones and establishing its superiority over existing state-of-the-art methods. The GitHub URL for our project is \n<uri>https://github.com/yr666666/MSA</uri>\n.","PeriodicalId":13213,"journal":{"name":"IEEE Transactions on Geoscience and Remote Sensing","volume":"62 ","pages":"1-17"},"PeriodicalIF":7.5000,"publicationDate":"2024-11-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Geoscience and Remote Sensing","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10758255/","RegionNum":1,"RegionCategory":"地球科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
引用次数: 0

Abstract

Remote sensing image-text retrieval (RSITR) is pivotal for knowledge services and data mining in the remote sensing (RS) domain. Considering the multiscale representations in image content and text vocabulary can enable the models to learn richer representations and enhance retrieval. Current multiscale RSITR approaches typically align multiscale fused image features with text features but overlook aligning image-text pairs at distinct scales separately. This oversight restricts their ability to learn joint representations suitable for effective retrieval. We introduce a novel multiscale alignment (MSA) method to overcome this limitation. Our method comprises three key innovations: 1) a multiscale cross-modal alignment transformer (MSCMAT), which computes cross-attention between single-scale image features and localized text features, integrating global textual context to derive a matching score matrix within a mini-batch; 2) a multiscale cross-modal semantic alignment loss (MSCMA loss) that enforces semantic alignment across scales; and 3) a cross-scale multimodal semantic consistency loss (CSMMC loss) that uses the matching matrix from the largest scale to guide alignment at smaller scales. We evaluated our method across multiple datasets, demonstrating its efficacy with various visual backbones and establishing its superiority over existing state-of-the-art methods. The GitHub URL for our project is https://github.com/yr666666/MSA .
超越融合:遥感图像-文本检索的多尺度对齐方法
遥感图像-文本检索(RSITR)对于遥感(RS)领域的知识服务和数据挖掘至关重要。考虑图像内容和文本词汇中的多尺度表征可以使模型学习到更丰富的表征并提高检索效率。目前的多尺度 RSITR 方法通常将多尺度融合图像特征与文本特征进行对齐,但忽略了将不同尺度的图像-文本对分别进行对齐。这种疏忽限制了它们学习适合有效检索的联合表征的能力。我们引入了一种新颖的多尺度对齐(MSA)方法来克服这一局限。我们的方法包括三项关键创新:1)多尺度跨模态配准转换器(MSCMAT),计算单尺度图像特征和局部文本特征之间的交叉注意,整合全局文本上下文,在一个小批量中得出匹配得分矩阵;2)多尺度跨模态语义配准损失(MSCMA loss),强制跨尺度语义配准;3)跨尺度多模态语义一致性损失(CSMMC loss),使用最大尺度的匹配矩阵指导较小尺度的配准。我们在多个数据集上评估了我们的方法,证明了它在各种视觉骨干上的有效性,并确立了它优于现有先进方法的地位。我们项目的 GitHub 网址是 https://github.com/yr666666/MSA。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
IEEE Transactions on Geoscience and Remote Sensing
IEEE Transactions on Geoscience and Remote Sensing 工程技术-地球化学与地球物理
CiteScore
11.50
自引率
28.00%
发文量
1912
审稿时长
4.0 months
期刊介绍: IEEE Transactions on Geoscience and Remote Sensing (TGRS) is a monthly publication that focuses on the theory, concepts, and techniques of science and engineering as applied to sensing the land, oceans, atmosphere, and space; and the processing, interpretation, and dissemination of this information.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信