{"title":"CSA-RSIC: Cross-Modal Semantic Alignment for Remote Sensing Image Captioning","authors":"Kangda Cheng;Jinlong Liu;Rui Mao;Zhilu Wu;Erik Cambria","doi":"10.1109/LGRS.2025.3601114","DOIUrl":null,"url":null,"abstract":"Remote sensing image captioning (RSIC) is an important task in environmental monitoring and disaster assessment. However, existing methods are constrained by redundant feature interference, insufficient multiscale feature integration, and cross-modal semantic gaps, leading to limited performance in scenarios requiring fine-grained descriptions and semantic integrity, such as disaster assessment and emergency response. In this letter, we propose a cross-modal semantic alignment model for RSIC (CSA-RSIC), addressing these challenges with three innovations. First, we designed an adaptive feature selection module (AFSM) that generates channel weights through dual pooling. The AFSM dynamically weights the most informative features at each scale to improve caption accuracy. Second, we propose a cross-scale feature aggregation module (CFAM) that constructs a hierarchical feature pyramid by aligning multiscale resolutions and performs attention-guided fusion with enhanced weighting via AFSM, ensuring the effective integration of fine-grained and global semantic information. Finally, a novel loss function that combines contrastive learning and consistency loss is proposed to enhance the semantic alignment between visual and textual features. Experiments on three datasets show the advancement of CSA-RSIC over strong baselines, indicating its effectiveness in enhancing both semantic completeness and accuracy.","PeriodicalId":91017,"journal":{"name":"IEEE geoscience and remote sensing letters : a publication of the IEEE Geoscience and Remote Sensing Society","volume":"22 ","pages":"1-5"},"PeriodicalIF":4.4000,"publicationDate":"2025-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE geoscience and remote sensing letters : a publication of the IEEE Geoscience and Remote Sensing Society","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/11133432/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Remote sensing image captioning (RSIC) is an important task in environmental monitoring and disaster assessment. However, existing methods are constrained by redundant feature interference, insufficient multiscale feature integration, and cross-modal semantic gaps, leading to limited performance in scenarios requiring fine-grained descriptions and semantic integrity, such as disaster assessment and emergency response. In this letter, we propose a cross-modal semantic alignment model for RSIC (CSA-RSIC), addressing these challenges with three innovations. First, we designed an adaptive feature selection module (AFSM) that generates channel weights through dual pooling. The AFSM dynamically weights the most informative features at each scale to improve caption accuracy. Second, we propose a cross-scale feature aggregation module (CFAM) that constructs a hierarchical feature pyramid by aligning multiscale resolutions and performs attention-guided fusion with enhanced weighting via AFSM, ensuring the effective integration of fine-grained and global semantic information. Finally, a novel loss function that combines contrastive learning and consistency loss is proposed to enhance the semantic alignment between visual and textual features. Experiments on three datasets show the advancement of CSA-RSIC over strong baselines, indicating its effectiveness in enhancing both semantic completeness and accuracy.