CSA-RSIC: Cross-Modal Semantic Alignment for Remote Sensing Image Captioning

IF 4.4

IEEE geoscience and remote sensing letters : a publication of the IEEE Geoscience and Remote Sensing Society Pub Date : 2025-08-21 DOI:10.1109/LGRS.2025.3601114

Kangda Cheng;Jinlong Liu;Rui Mao;Zhilu Wu;Erik Cambria

{"title":"CSA-RSIC: Cross-Modal Semantic Alignment for Remote Sensing Image Captioning","authors":"Kangda Cheng;Jinlong Liu;Rui Mao;Zhilu Wu;Erik Cambria","doi":"10.1109/LGRS.2025.3601114","DOIUrl":null,"url":null,"abstract":"Remote sensing image captioning (RSIC) is an important task in environmental monitoring and disaster assessment. However, existing methods are constrained by redundant feature interference, insufficient multiscale feature integration, and cross-modal semantic gaps, leading to limited performance in scenarios requiring fine-grained descriptions and semantic integrity, such as disaster assessment and emergency response. In this letter, we propose a cross-modal semantic alignment model for RSIC (CSA-RSIC), addressing these challenges with three innovations. First, we designed an adaptive feature selection module (AFSM) that generates channel weights through dual pooling. The AFSM dynamically weights the most informative features at each scale to improve caption accuracy. Second, we propose a cross-scale feature aggregation module (CFAM) that constructs a hierarchical feature pyramid by aligning multiscale resolutions and performs attention-guided fusion with enhanced weighting via AFSM, ensuring the effective integration of fine-grained and global semantic information. Finally, a novel loss function that combines contrastive learning and consistency loss is proposed to enhance the semantic alignment between visual and textual features. Experiments on three datasets show the advancement of CSA-RSIC over strong baselines, indicating its effectiveness in enhancing both semantic completeness and accuracy.","PeriodicalId":91017,"journal":{"name":"IEEE geoscience and remote sensing letters : a publication of the IEEE Geoscience and Remote Sensing Society","volume":"22 ","pages":"1-5"},"PeriodicalIF":4.4000,"publicationDate":"2025-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE geoscience and remote sensing letters : a publication of the IEEE Geoscience and Remote Sensing Society","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/11133432/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Remote sensing image captioning (RSIC) is an important task in environmental monitoring and disaster assessment. However, existing methods are constrained by redundant feature interference, insufficient multiscale feature integration, and cross-modal semantic gaps, leading to limited performance in scenarios requiring fine-grained descriptions and semantic integrity, such as disaster assessment and emergency response. In this letter, we propose a cross-modal semantic alignment model for RSIC (CSA-RSIC), addressing these challenges with three innovations. First, we designed an adaptive feature selection module (AFSM) that generates channel weights through dual pooling. The AFSM dynamically weights the most informative features at each scale to improve caption accuracy. Second, we propose a cross-scale feature aggregation module (CFAM) that constructs a hierarchical feature pyramid by aligning multiscale resolutions and performs attention-guided fusion with enhanced weighting via AFSM, ensuring the effective integration of fine-grained and global semantic information. Finally, a novel loss function that combines contrastive learning and consistency loss is proposed to enhance the semantic alignment between visual and textual features. Experiments on three datasets show the advancement of CSA-RSIC over strong baselines, indicating its effectiveness in enhancing both semantic completeness and accuracy.

查看原文本刊更多论文

遥感图像标注的跨模态语义对齐

遥感图像字幕处理是环境监测和灾害评估中的一项重要任务。然而，现有方法受到冗余特征干扰、多尺度特征集成不足和跨模态语义缺口的限制，导致在需要细粒度描述和语义完整性的场景下，如灾害评估和应急响应，性能有限。在这封信中，我们提出了一个RSIC的跨模态语义对齐模型（CSA-RSIC），通过三个创新来解决这些挑战。首先，我们设计了一个自适应特征选择模块（AFSM），该模块通过双池化生成信道权重。AFSM动态地对每个尺度上信息量最大的特征进行加权，以提高标题的准确性。其次，我们提出了一种跨尺度特征聚合模块（CFAM），该模块通过对齐多尺度分辨率构建分层特征金字塔，并通过AFSM进行注意引导融合和增强加权，确保了细粒度和全局语义信息的有效集成。最后，提出了一种结合对比学习和一致性损失的损失函数，以增强视觉特征和文本特征之间的语义一致性。在三个数据集上的实验表明，CSA-RSIC在强基线上的进步表明其在提高语义完整性和准确性方面都是有效的。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE geoscience and remote sensing letters : a publication of the IEEE Geoscience and Remote Sensing Society

自引率

0.00%

发文量