{"title":"GLMambaNet:用于遥感图像语义分割的基于mamba的局部细节增强解码器","authors":"Zhengyu Zhu , Xinaoxue Zhang , Xiaobo Zhang , Zixuan Zhao , Feng Chen","doi":"10.1016/j.imavis.2025.105774","DOIUrl":null,"url":null,"abstract":"<div><div>Accurate semantic segmentation of high-resolution remote sensing imagery is critical for land cover classification, supporting applications ranging from urban infrastructure planning to ecological conservation. In the context of remote sensing, this task is particularly challenging due to the high spatial resolution, spectral complexity, and the presence of small or irregularly shaped objects. Existing methods often struggle to balance global context modeling and local detail preservation—both essential for precise segmentation in complex scenes. This motivates the design of new architectures capable of capturing long-range dependencies while remaining sensitive to fine-grained spatial details, without incurring excessive computational cost. While Transformer architectures effectively model long-range dependencies, their quadratic complexity limits scalability for high-resolution imagery. To address these challenges, we present GLMambaNet, a dual-stream architecture that combines Swin Transformer’s hierarchical encoding with a novel Mamba-based decoder. The framework introduces two core components: the Mamba Global Context Module (MGCM), which leverages state space modeling with channel attention to enhance global–local context integration, and the Local Detail Enhancement Module (LDEM), which improves boundary and texture preservation through gradient-aware convolutions. On the Vaihingen dataset, our model achieves a mean F1-score of 91.82% and mIoU of 85.29%, surpassing CNN- and Transformer-based baselines in capturing fine details such as vehicle edges and shadows. On the Potsdam dataset, it achieves an mIoU of 87.58%, delivering enhanced performance across key classes including buildings, trees, and cars. These results demonstrate that GLMambaNet effectively balances segmentation accuracy and model complexity, providing a strong foundation for practical remote sensing applications.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"163 ","pages":"Article 105774"},"PeriodicalIF":4.2000,"publicationDate":"2025-10-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"GLMambaNet: Mamba-based decoder with local detail enhancement for semantic segmentation of remote sensing imagery\",\"authors\":\"Zhengyu Zhu , Xinaoxue Zhang , Xiaobo Zhang , Zixuan Zhao , Feng Chen\",\"doi\":\"10.1016/j.imavis.2025.105774\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Accurate semantic segmentation of high-resolution remote sensing imagery is critical for land cover classification, supporting applications ranging from urban infrastructure planning to ecological conservation. In the context of remote sensing, this task is particularly challenging due to the high spatial resolution, spectral complexity, and the presence of small or irregularly shaped objects. Existing methods often struggle to balance global context modeling and local detail preservation—both essential for precise segmentation in complex scenes. This motivates the design of new architectures capable of capturing long-range dependencies while remaining sensitive to fine-grained spatial details, without incurring excessive computational cost. While Transformer architectures effectively model long-range dependencies, their quadratic complexity limits scalability for high-resolution imagery. To address these challenges, we present GLMambaNet, a dual-stream architecture that combines Swin Transformer’s hierarchical encoding with a novel Mamba-based decoder. The framework introduces two core components: the Mamba Global Context Module (MGCM), which leverages state space modeling with channel attention to enhance global–local context integration, and the Local Detail Enhancement Module (LDEM), which improves boundary and texture preservation through gradient-aware convolutions. On the Vaihingen dataset, our model achieves a mean F1-score of 91.82% and mIoU of 85.29%, surpassing CNN- and Transformer-based baselines in capturing fine details such as vehicle edges and shadows. On the Potsdam dataset, it achieves an mIoU of 87.58%, delivering enhanced performance across key classes including buildings, trees, and cars. These results demonstrate that GLMambaNet effectively balances segmentation accuracy and model complexity, providing a strong foundation for practical remote sensing applications.</div></div>\",\"PeriodicalId\":50374,\"journal\":{\"name\":\"Image and Vision Computing\",\"volume\":\"163 \",\"pages\":\"Article 105774\"},\"PeriodicalIF\":4.2000,\"publicationDate\":\"2025-10-17\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Image and Vision Computing\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0262885625003622\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Image and Vision Computing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0262885625003622","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
GLMambaNet: Mamba-based decoder with local detail enhancement for semantic segmentation of remote sensing imagery
Accurate semantic segmentation of high-resolution remote sensing imagery is critical for land cover classification, supporting applications ranging from urban infrastructure planning to ecological conservation. In the context of remote sensing, this task is particularly challenging due to the high spatial resolution, spectral complexity, and the presence of small or irregularly shaped objects. Existing methods often struggle to balance global context modeling and local detail preservation—both essential for precise segmentation in complex scenes. This motivates the design of new architectures capable of capturing long-range dependencies while remaining sensitive to fine-grained spatial details, without incurring excessive computational cost. While Transformer architectures effectively model long-range dependencies, their quadratic complexity limits scalability for high-resolution imagery. To address these challenges, we present GLMambaNet, a dual-stream architecture that combines Swin Transformer’s hierarchical encoding with a novel Mamba-based decoder. The framework introduces two core components: the Mamba Global Context Module (MGCM), which leverages state space modeling with channel attention to enhance global–local context integration, and the Local Detail Enhancement Module (LDEM), which improves boundary and texture preservation through gradient-aware convolutions. On the Vaihingen dataset, our model achieves a mean F1-score of 91.82% and mIoU of 85.29%, surpassing CNN- and Transformer-based baselines in capturing fine details such as vehicle edges and shadows. On the Potsdam dataset, it achieves an mIoU of 87.58%, delivering enhanced performance across key classes including buildings, trees, and cars. These results demonstrate that GLMambaNet effectively balances segmentation accuracy and model complexity, providing a strong foundation for practical remote sensing applications.
期刊介绍:
Image and Vision Computing has as a primary aim the provision of an effective medium of interchange for the results of high quality theoretical and applied research fundamental to all aspects of image interpretation and computer vision. The journal publishes work that proposes new image interpretation and computer vision methodology or addresses the application of such methods to real world scenes. It seeks to strengthen a deeper understanding in the discipline by encouraging the quantitative comparison and performance evaluation of the proposed methodology. The coverage includes: image interpretation, scene modelling, object recognition and tracking, shape analysis, monitoring and surveillance, active vision and robotic systems, SLAM, biologically-inspired computer vision, motion analysis, stereo vision, document image understanding, character and handwritten text recognition, face and gesture recognition, biometrics, vision-based human-computer interaction, human activity and behavior understanding, data fusion from multiple sensor inputs, image databases.