Establishing Nuanced Multimodal Attention for Weakly Supervised Semantic Segmentation of Remote Sensing Scenes

IEEE geoscience and remote sensing letters : a publication of the IEEE Geoscience and Remote Sensing Society Pub Date : 2025-04-30 DOI:10.1109/LGRS.2025.3565710

Qiming Zhang;Junjie Zhang;Huaxi Huang;Fangyu Wu;Hongwen Yu

{"title":"Establishing Nuanced Multimodal Attention for Weakly Supervised Semantic Segmentation of Remote Sensing Scenes","authors":"Qiming Zhang;Junjie Zhang;Huaxi Huang;Fangyu Wu;Hongwen Yu","doi":"10.1109/LGRS.2025.3565710","DOIUrl":null,"url":null,"abstract":"Weakly supervised semantic segmentation (WSSS) with image-level labels reduces reliance on pixel-level annotations for remote sensing (RS) imagery. However, in natural scenes, WSSS frequently faces challenges such as imprecise localization, extraneous activations, and class ambiguity. These challenges are particularly pronounced in RS images, characterized by complex backgrounds, substantial scale variations, and dense small-object distributions, complicating the distinction between intraclass variations and interclass similarities. To tackle these challenges, we introduce a class-constrained multimodal attention framework aimed at enhancing the localization accuracy of class activation maps (CAMs). Specifically, we design class-specific tokens to capture the visual characteristics of each target class. As these tokens initially lack explicit constraints, we integrate the textual branch of the RemoteCLIP model to leverage class-related linguistic priors, which collaborate with visual features to encode the specific semantics of diverse objects. Furthermore, the multimodal collaborative optimization module dynamically establishes tailored attention mechanisms for both global and regional features, thereby improving class discriminability among targets to mitigate challenges such as interclass similarity and dense small-object distributions. By refining class-specific attention, textual semantic attention, and patch-level pairwise affinity weights, the quality of generated pseudomasks is markedly enhanced. Concurrently, to ensure domain-invariant feature learning, we align the backbone features with the CLIP visual embedding by minimizing the distribution disparity between the two in the latent space, and semantic consistency is, therefore, preserved. The experimental results validate the effectiveness and robustness of our proposed method, achieving significant performance improvements on two representative RS WSSS datasets.","PeriodicalId":91017,"journal":{"name":"IEEE geoscience and remote sensing letters : a publication of the IEEE Geoscience and Remote Sensing Society","volume":"22 ","pages":"1-5"},"PeriodicalIF":0.0000,"publicationDate":"2025-04-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE geoscience and remote sensing letters : a publication of the IEEE Geoscience and Remote Sensing Society","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10980290/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Weakly supervised semantic segmentation (WSSS) with image-level labels reduces reliance on pixel-level annotations for remote sensing (RS) imagery. However, in natural scenes, WSSS frequently faces challenges such as imprecise localization, extraneous activations, and class ambiguity. These challenges are particularly pronounced in RS images, characterized by complex backgrounds, substantial scale variations, and dense small-object distributions, complicating the distinction between intraclass variations and interclass similarities. To tackle these challenges, we introduce a class-constrained multimodal attention framework aimed at enhancing the localization accuracy of class activation maps (CAMs). Specifically, we design class-specific tokens to capture the visual characteristics of each target class. As these tokens initially lack explicit constraints, we integrate the textual branch of the RemoteCLIP model to leverage class-related linguistic priors, which collaborate with visual features to encode the specific semantics of diverse objects. Furthermore, the multimodal collaborative optimization module dynamically establishes tailored attention mechanisms for both global and regional features, thereby improving class discriminability among targets to mitigate challenges such as interclass similarity and dense small-object distributions. By refining class-specific attention, textual semantic attention, and patch-level pairwise affinity weights, the quality of generated pseudomasks is markedly enhanced. Concurrently, to ensure domain-invariant feature learning, we align the backbone features with the CLIP visual embedding by minimizing the distribution disparity between the two in the latent space, and semantic consistency is, therefore, preserved. The experimental results validate the effectiveness and robustness of our proposed method, achieving significant performance improvements on two representative RS WSSS datasets.

查看原文本刊更多论文

建立遥感场景弱监督语义分割的细致多模态注意

基于图像级标签的弱监督语义分割（WSSS）减少了遥感图像对像素级注释的依赖。然而，在自然场景中，WSSS经常面临诸如定位不精确、外部激活和类模糊等挑战。这些挑战在RS图像中尤其明显，其特征是复杂的背景、大量的尺度变化和密集的小物体分布，使得区分类内变化和类间相似性变得更加复杂。为了解决这些挑战，我们引入了一个类约束的多模态注意力框架，旨在提高类激活图（CAMs）的定位精度。具体来说，我们设计特定于类的令牌来捕获每个目标类的视觉特征。由于这些令牌最初缺乏明确的约束，我们集成了RemoteCLIP模型的文本分支，以利用与类相关的语言先验，它与视觉特征协作，对不同对象的特定语义进行编码。此外，多模态协同优化模块针对全局和区域特征动态建立量身定制的关注机制，从而提高目标之间的类可辨别性，从而缓解类间相似性和密集小目标分布等挑战。通过细化类特定注意、文本语义注意和补丁级成对亲和权值，生成的伪掩码的质量得到显著提高。同时，为了保证特征学习的域不变性，我们将主干特征与CLIP视觉嵌入对齐，最小化两者在潜在空间中的分布差异，从而保持语义一致性。实验结果验证了该方法的有效性和鲁棒性，在两个具有代表性的RS WSSS数据集上取得了显著的性能提升。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE geoscience and remote sensing letters : a publication of the IEEE Geoscience and Remote Sensing Society

自引率

0.00%

发文量