Wenqian Chen , Wendie Yue , Kai Chang , Hongzhi Wang , Kaijun Tan , Xinyu Liu , Xiaoyi Cao
{"title":"基于动态特征增强和多模态对齐融合的遥感图像高效语义分割","authors":"Wenqian Chen , Wendie Yue , Kai Chang , Hongzhi Wang , Kaijun Tan , Xinyu Liu , Xiaoyi Cao","doi":"10.1016/j.neucom.2025.131555","DOIUrl":null,"url":null,"abstract":"<div><div>Multimodal fusion has shown promising applications in integrating information from different modalities. However, existing multimodal fusion approaches in remote sensing face two main challenges: First, multimodal fusion models relying on Convolutional Neural Networks (CNNs) or Visual Transformers (ViTs) have limitations in terms of remote modeling capabilities and computational complexity, while state-space model (SSM)-based fusion models are prone to feature redundancy due to the use of multiple scanning paths, and similarly suffer from high computational complexity. Second, existing methods do not fully address inter-modal heterogeneity, leading to poor multimodal data fusion. To address these issues, we propose an efficient multimodal fusion network, AFMamba, based on the state-space model (SSM) for semantic segmentation of remote sensing images. Specifically, we design the Efficient Dynamic Visual State Space (EDVSS) module, which enhances the efficiency of the standard Mamba model by dynamically improving local features and reducing channel redundancy. Furthermore, we introduce the Cross Attention Alignment Fusion (CAAFM) module, which combines cross-image attention fusion and channel interaction alignment to effectively improve the accuracy and efficiency of cross-modal feature fusion and mitigate feature inconsistency. Experimental results demonstrate that in multimodal hyperspectral image semantic segmentation, the proposed model reduces computational complexity, measured in GFLOPs, by at least 61 % while maintaining a low parameter count, achieving optimal overall accuracy (OA) of around 92 %, and effectively balancing performance and computational efficiency.</div></div>","PeriodicalId":19268,"journal":{"name":"Neurocomputing","volume":"657 ","pages":"Article 131555"},"PeriodicalIF":6.5000,"publicationDate":"2025-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Efficient semantic segmentation of remote sensing images through dynamic feature enhancement and multimodal alignment fusion\",\"authors\":\"Wenqian Chen , Wendie Yue , Kai Chang , Hongzhi Wang , Kaijun Tan , Xinyu Liu , Xiaoyi Cao\",\"doi\":\"10.1016/j.neucom.2025.131555\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Multimodal fusion has shown promising applications in integrating information from different modalities. However, existing multimodal fusion approaches in remote sensing face two main challenges: First, multimodal fusion models relying on Convolutional Neural Networks (CNNs) or Visual Transformers (ViTs) have limitations in terms of remote modeling capabilities and computational complexity, while state-space model (SSM)-based fusion models are prone to feature redundancy due to the use of multiple scanning paths, and similarly suffer from high computational complexity. Second, existing methods do not fully address inter-modal heterogeneity, leading to poor multimodal data fusion. To address these issues, we propose an efficient multimodal fusion network, AFMamba, based on the state-space model (SSM) for semantic segmentation of remote sensing images. Specifically, we design the Efficient Dynamic Visual State Space (EDVSS) module, which enhances the efficiency of the standard Mamba model by dynamically improving local features and reducing channel redundancy. Furthermore, we introduce the Cross Attention Alignment Fusion (CAAFM) module, which combines cross-image attention fusion and channel interaction alignment to effectively improve the accuracy and efficiency of cross-modal feature fusion and mitigate feature inconsistency. Experimental results demonstrate that in multimodal hyperspectral image semantic segmentation, the proposed model reduces computational complexity, measured in GFLOPs, by at least 61 % while maintaining a low parameter count, achieving optimal overall accuracy (OA) of around 92 %, and effectively balancing performance and computational efficiency.</div></div>\",\"PeriodicalId\":19268,\"journal\":{\"name\":\"Neurocomputing\",\"volume\":\"657 \",\"pages\":\"Article 131555\"},\"PeriodicalIF\":6.5000,\"publicationDate\":\"2025-09-16\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Neurocomputing\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0925231225022271\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Neurocomputing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0925231225022271","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
Efficient semantic segmentation of remote sensing images through dynamic feature enhancement and multimodal alignment fusion
Multimodal fusion has shown promising applications in integrating information from different modalities. However, existing multimodal fusion approaches in remote sensing face two main challenges: First, multimodal fusion models relying on Convolutional Neural Networks (CNNs) or Visual Transformers (ViTs) have limitations in terms of remote modeling capabilities and computational complexity, while state-space model (SSM)-based fusion models are prone to feature redundancy due to the use of multiple scanning paths, and similarly suffer from high computational complexity. Second, existing methods do not fully address inter-modal heterogeneity, leading to poor multimodal data fusion. To address these issues, we propose an efficient multimodal fusion network, AFMamba, based on the state-space model (SSM) for semantic segmentation of remote sensing images. Specifically, we design the Efficient Dynamic Visual State Space (EDVSS) module, which enhances the efficiency of the standard Mamba model by dynamically improving local features and reducing channel redundancy. Furthermore, we introduce the Cross Attention Alignment Fusion (CAAFM) module, which combines cross-image attention fusion and channel interaction alignment to effectively improve the accuracy and efficiency of cross-modal feature fusion and mitigate feature inconsistency. Experimental results demonstrate that in multimodal hyperspectral image semantic segmentation, the proposed model reduces computational complexity, measured in GFLOPs, by at least 61 % while maintaining a low parameter count, achieving optimal overall accuracy (OA) of around 92 %, and effectively balancing performance and computational efficiency.
期刊介绍:
Neurocomputing publishes articles describing recent fundamental contributions in the field of neurocomputing. Neurocomputing theory, practice and applications are the essential topics being covered.