{"title":"MBSSNet: A Mamba-Based Joint Semantic Segmentation Network for Optical and SAR Images","authors":"Jie Li;Zhanhong Liu;Shujun Liu;Huajun Wang","doi":"10.1109/LGRS.2025.3541895","DOIUrl":null,"url":null,"abstract":"The utilization of both optical and synthetic aperture radar (SAR) images for joint semantic segmentation enhances the accuracy of land use classification. Recent advancements in multimodal fusion models, particularly those using self-attention mechanisms and convolutional neural networks (CNNs), have yielded significant results. However, self-attention has quadratic computational complexity, and CNN has insufficient local-global contextual modeling power. Recently, 2-D-selective-scan (SS2D) has emerged as a promising approach. It excels in modeling long-range dependencies while maintaining linear computational complexity. Based on SS2D, we propose a joint semantic segmentation network for optical and SAR images, called MBSSNet. Specifically, we introduce SS2D and design a cross-modal fusion module (CMFM) to fuse multimodal features from dual branches layer by layer, thereby enhancing the consistency of fused feature representations. In addition, during the decoding phase, we integrate contextual information from multiscale fusion features, thereby enhancing the spatial and semantic information of the fused features. Our experimental results show that our method outperforms the state-of-the-art (SOTA), and overall accuracy (OA), mean intersection over union (mIoU), and Kappa outperform other SOTA methods by 1.7%, 3.1%, and 2.2%, respectively.","PeriodicalId":91017,"journal":{"name":"IEEE geoscience and remote sensing letters : a publication of the IEEE Geoscience and Remote Sensing Society","volume":"22 ","pages":"1-5"},"PeriodicalIF":0.0000,"publicationDate":"2025-02-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE geoscience and remote sensing letters : a publication of the IEEE Geoscience and Remote Sensing Society","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10884783/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
The utilization of both optical and synthetic aperture radar (SAR) images for joint semantic segmentation enhances the accuracy of land use classification. Recent advancements in multimodal fusion models, particularly those using self-attention mechanisms and convolutional neural networks (CNNs), have yielded significant results. However, self-attention has quadratic computational complexity, and CNN has insufficient local-global contextual modeling power. Recently, 2-D-selective-scan (SS2D) has emerged as a promising approach. It excels in modeling long-range dependencies while maintaining linear computational complexity. Based on SS2D, we propose a joint semantic segmentation network for optical and SAR images, called MBSSNet. Specifically, we introduce SS2D and design a cross-modal fusion module (CMFM) to fuse multimodal features from dual branches layer by layer, thereby enhancing the consistency of fused feature representations. In addition, during the decoding phase, we integrate contextual information from multiscale fusion features, thereby enhancing the spatial and semantic information of the fused features. Our experimental results show that our method outperforms the state-of-the-art (SOTA), and overall accuracy (OA), mean intersection over union (mIoU), and Kappa outperform other SOTA methods by 1.7%, 3.1%, and 2.2%, respectively.