{"title":"Establishing Nuanced Multimodal Attention for Weakly Supervised Semantic Segmentation of Remote Sensing Scenes","authors":"Qiming Zhang;Junjie Zhang;Huaxi Huang;Fangyu Wu;Hongwen Yu","doi":"10.1109/LGRS.2025.3565710","DOIUrl":null,"url":null,"abstract":"Weakly supervised semantic segmentation (WSSS) with image-level labels reduces reliance on pixel-level annotations for remote sensing (RS) imagery. However, in natural scenes, WSSS frequently faces challenges such as imprecise localization, extraneous activations, and class ambiguity. These challenges are particularly pronounced in RS images, characterized by complex backgrounds, substantial scale variations, and dense small-object distributions, complicating the distinction between intraclass variations and interclass similarities. To tackle these challenges, we introduce a class-constrained multimodal attention framework aimed at enhancing the localization accuracy of class activation maps (CAMs). Specifically, we design class-specific tokens to capture the visual characteristics of each target class. As these tokens initially lack explicit constraints, we integrate the textual branch of the RemoteCLIP model to leverage class-related linguistic priors, which collaborate with visual features to encode the specific semantics of diverse objects. Furthermore, the multimodal collaborative optimization module dynamically establishes tailored attention mechanisms for both global and regional features, thereby improving class discriminability among targets to mitigate challenges such as interclass similarity and dense small-object distributions. By refining class-specific attention, textual semantic attention, and patch-level pairwise affinity weights, the quality of generated pseudomasks is markedly enhanced. Concurrently, to ensure domain-invariant feature learning, we align the backbone features with the CLIP visual embedding by minimizing the distribution disparity between the two in the latent space, and semantic consistency is, therefore, preserved. The experimental results validate the effectiveness and robustness of our proposed method, achieving significant performance improvements on two representative RS WSSS datasets.","PeriodicalId":91017,"journal":{"name":"IEEE geoscience and remote sensing letters : a publication of the IEEE Geoscience and Remote Sensing Society","volume":"22 ","pages":"1-5"},"PeriodicalIF":0.0000,"publicationDate":"2025-04-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE geoscience and remote sensing letters : a publication of the IEEE Geoscience and Remote Sensing Society","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10980290/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Weakly supervised semantic segmentation (WSSS) with image-level labels reduces reliance on pixel-level annotations for remote sensing (RS) imagery. However, in natural scenes, WSSS frequently faces challenges such as imprecise localization, extraneous activations, and class ambiguity. These challenges are particularly pronounced in RS images, characterized by complex backgrounds, substantial scale variations, and dense small-object distributions, complicating the distinction between intraclass variations and interclass similarities. To tackle these challenges, we introduce a class-constrained multimodal attention framework aimed at enhancing the localization accuracy of class activation maps (CAMs). Specifically, we design class-specific tokens to capture the visual characteristics of each target class. As these tokens initially lack explicit constraints, we integrate the textual branch of the RemoteCLIP model to leverage class-related linguistic priors, which collaborate with visual features to encode the specific semantics of diverse objects. Furthermore, the multimodal collaborative optimization module dynamically establishes tailored attention mechanisms for both global and regional features, thereby improving class discriminability among targets to mitigate challenges such as interclass similarity and dense small-object distributions. By refining class-specific attention, textual semantic attention, and patch-level pairwise affinity weights, the quality of generated pseudomasks is markedly enhanced. Concurrently, to ensure domain-invariant feature learning, we align the backbone features with the CLIP visual embedding by minimizing the distribution disparity between the two in the latent space, and semantic consistency is, therefore, preserved. The experimental results validate the effectiveness and robustness of our proposed method, achieving significant performance improvements on two representative RS WSSS datasets.