Rui Yang;Shuang Wang;Yingping Han;Yuanheng Li;Dong Zhao;Dou Quan;Yanhe Guo;Licheng Jiao;Zhi Yang
{"title":"Transcending Fusion: A Multiscale Alignment Method for Remote Sensing Image–Text Retrieval","authors":"Rui Yang;Shuang Wang;Yingping Han;Yuanheng Li;Dong Zhao;Dou Quan;Yanhe Guo;Licheng Jiao;Zhi Yang","doi":"10.1109/TGRS.2024.3496898","DOIUrl":null,"url":null,"abstract":"Remote sensing image-text retrieval (RSITR) is pivotal for knowledge services and data mining in the remote sensing (RS) domain. Considering the multiscale representations in image content and text vocabulary can enable the models to learn richer representations and enhance retrieval. Current multiscale RSITR approaches typically align multiscale fused image features with text features but overlook aligning image-text pairs at distinct scales separately. This oversight restricts their ability to learn joint representations suitable for effective retrieval. We introduce a novel multiscale alignment (MSA) method to overcome this limitation. Our method comprises three key innovations: 1) a multiscale cross-modal alignment transformer (MSCMAT), which computes cross-attention between single-scale image features and localized text features, integrating global textual context to derive a matching score matrix within a mini-batch; 2) a multiscale cross-modal semantic alignment loss (MSCMA loss) that enforces semantic alignment across scales; and 3) a cross-scale multimodal semantic consistency loss (CSMMC loss) that uses the matching matrix from the largest scale to guide alignment at smaller scales. We evaluated our method across multiple datasets, demonstrating its efficacy with various visual backbones and establishing its superiority over existing state-of-the-art methods. The GitHub URL for our project is \n<uri>https://github.com/yr666666/MSA</uri>\n.","PeriodicalId":13213,"journal":{"name":"IEEE Transactions on Geoscience and Remote Sensing","volume":"62 ","pages":"1-17"},"PeriodicalIF":7.5000,"publicationDate":"2024-11-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Geoscience and Remote Sensing","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10758255/","RegionNum":1,"RegionCategory":"地球科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
引用次数: 0
Abstract
Remote sensing image-text retrieval (RSITR) is pivotal for knowledge services and data mining in the remote sensing (RS) domain. Considering the multiscale representations in image content and text vocabulary can enable the models to learn richer representations and enhance retrieval. Current multiscale RSITR approaches typically align multiscale fused image features with text features but overlook aligning image-text pairs at distinct scales separately. This oversight restricts their ability to learn joint representations suitable for effective retrieval. We introduce a novel multiscale alignment (MSA) method to overcome this limitation. Our method comprises three key innovations: 1) a multiscale cross-modal alignment transformer (MSCMAT), which computes cross-attention between single-scale image features and localized text features, integrating global textual context to derive a matching score matrix within a mini-batch; 2) a multiscale cross-modal semantic alignment loss (MSCMA loss) that enforces semantic alignment across scales; and 3) a cross-scale multimodal semantic consistency loss (CSMMC loss) that uses the matching matrix from the largest scale to guide alignment at smaller scales. We evaluated our method across multiple datasets, demonstrating its efficacy with various visual backbones and establishing its superiority over existing state-of-the-art methods. The GitHub URL for our project is
https://github.com/yr666666/MSA
.
期刊介绍:
IEEE Transactions on Geoscience and Remote Sensing (TGRS) is a monthly publication that focuses on the theory, concepts, and techniques of science and engineering as applied to sensing the land, oceans, atmosphere, and space; and the processing, interpretation, and dissemination of this information.