FlexiSAM：一个灵活的基于sam的语义分割模型，用于使用高分辨率多模态遥感图像进行土地覆盖分类

IF 12.2 1区地球科学 Q1 GEOGRAPHY, PHYSICAL

ISPRS Journal of Photogrammetry and Remote Sensing Pub Date : 2025-06-28 DOI:10.1016/j.isprsjprs.2025.05.028

Zhan Zhang , Daoyu Shu , Cunyi Liao , Chengzhi Liu , Yuanxin Zhao , Ru Wang , Xiao Huang , Mi Zhang , Jianya Gong

{"title":"FlexiSAM：一个灵活的基于sam的语义分割模型，用于使用高分辨率多模态遥感图像进行土地覆盖分类","authors":"Zhan Zhang , Daoyu Shu , Cunyi Liao , Chengzhi Liu , Yuanxin Zhao , Ru Wang , Xiao Huang , Mi Zhang , Jianya Gong","doi":"10.1016/j.isprsjprs.2025.05.028","DOIUrl":null,"url":null,"abstract":"<div><div>Fine-grained land use and land cover (LULC) classification using high-resolution remote sensing (RS) imagery is fundamental to scientific research. Recently, the Segment Anything Model (SAM) has emerged as a major advance in deep learning-based LULC classification due to its robust segmentation and generalization capabilities. However, existing SAM-based models predominantly rely on single-modal inputs (e.g., optical RGB or SAR), limiting their ability to fully capture the complex spatial and spectral characteristics of RS imagery. Although multimodal RS data can provide complementary information to enhance classification accuracy, integrating multiple modalities into SAM presents significant challenges, including modality adaptation, semantic interference, and domain gaps. Building on this, we propose FlexiSAM, a SAM-based multimodal semantic segmentation model designed to overcome these challenges. FlexiSAM uses RGB as the primary modality while seamlessly integrating auxiliary RS modalities through a modular pipeline. Key innovations include the Dynamic Multimodal Feature Fusion Unit (DMMFU) and Dynamic Attention and the Context Aggregation Mixer (DACAM) for robust cross-modal feature fusion and refinement, and the Semantic Cross-Modal Integration Module (SCMII) for mitigating modality-induced feature misalignments and ensuring coherent multimodal integration. These are then processed by the adapted SAM encoder, enhanced with a lightweight adapter tailored for RS data, and followed by a dedicated decoder that produces precise classification outputs. Extensive experiments on the Korea, Houston2018, and Mini-FLAIR datasets, conducted using LuoJiaNET for core evaluations and PyTorch for cross-method comparisons, demonstrate FlexiSAM’s effectiveness and superiority, surpassing state-of-the-art models by at least 1.58% on Korea, 0.77% on Houston2018, and 1.14% in mIoU. Importantly, the LuoJiaNET framework delivers higher accuracy and efficiency compared to PyTorch. FlexiSAM also demonstrates strong adaptability and robustness across diverse RS modalities, establishing it as a versatile solution for fine-grained LULC classification.</div></div>","PeriodicalId":50269,"journal":{"name":"ISPRS Journal of Photogrammetry and Remote Sensing","volume":"227 ","pages":"Pages 594-612"},"PeriodicalIF":12.2000,"publicationDate":"2025-06-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"FlexiSAM: A flexible SAM-based semantic segmentation model for land cover classification using high-resolution multimodal remote sensing imagery\",\"authors\":\"Zhan Zhang , Daoyu Shu , Cunyi Liao , Chengzhi Liu , Yuanxin Zhao , Ru Wang , Xiao Huang , Mi Zhang , Jianya Gong\",\"doi\":\"10.1016/j.isprsjprs.2025.05.028\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Fine-grained land use and land cover (LULC) classification using high-resolution remote sensing (RS) imagery is fundamental to scientific research. Recently, the Segment Anything Model (SAM) has emerged as a major advance in deep learning-based LULC classification due to its robust segmentation and generalization capabilities. However, existing SAM-based models predominantly rely on single-modal inputs (e.g., optical RGB or SAR), limiting their ability to fully capture the complex spatial and spectral characteristics of RS imagery. Although multimodal RS data can provide complementary information to enhance classification accuracy, integrating multiple modalities into SAM presents significant challenges, including modality adaptation, semantic interference, and domain gaps. Building on this, we propose FlexiSAM, a SAM-based multimodal semantic segmentation model designed to overcome these challenges. FlexiSAM uses RGB as the primary modality while seamlessly integrating auxiliary RS modalities through a modular pipeline. Key innovations include the Dynamic Multimodal Feature Fusion Unit (DMMFU) and Dynamic Attention and the Context Aggregation Mixer (DACAM) for robust cross-modal feature fusion and refinement, and the Semantic Cross-Modal Integration Module (SCMII) for mitigating modality-induced feature misalignments and ensuring coherent multimodal integration. These are then processed by the adapted SAM encoder, enhanced with a lightweight adapter tailored for RS data, and followed by a dedicated decoder that produces precise classification outputs. Extensive experiments on the Korea, Houston2018, and Mini-FLAIR datasets, conducted using LuoJiaNET for core evaluations and PyTorch for cross-method comparisons, demonstrate FlexiSAM’s effectiveness and superiority, surpassing state-of-the-art models by at least 1.58% on Korea, 0.77% on Houston2018, and 1.14% in mIoU. Importantly, the LuoJiaNET framework delivers higher accuracy and efficiency compared to PyTorch. FlexiSAM also demonstrates strong adaptability and robustness across diverse RS modalities, establishing it as a versatile solution for fine-grained LULC classification.</div></div>\",\"PeriodicalId\":50269,\"journal\":{\"name\":\"ISPRS Journal of Photogrammetry and Remote Sensing\",\"volume\":\"227 \",\"pages\":\"Pages 594-612\"},\"PeriodicalIF\":12.2000,\"publicationDate\":\"2025-06-28\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"ISPRS Journal of Photogrammetry and Remote Sensing\",\"FirstCategoryId\":\"5\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0924271625002151\",\"RegionNum\":1,\"RegionCategory\":\"地球科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"GEOGRAPHY, PHYSICAL\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"ISPRS Journal of Photogrammetry and Remote Sensing","FirstCategoryId":"5","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0924271625002151","RegionNum":1,"RegionCategory":"地球科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"GEOGRAPHY, PHYSICAL","Score":null,"Total":0}

引用次数: 0

摘要

利用高分辨率遥感影像对土地利用和土地覆盖进行精细分类是科学研究的基础。最近，分段任意模型（SAM）由于其强大的分割和泛化能力而成为基于深度学习的LULC分类的主要进展。然而，现有的基于sam的模型主要依赖于单模态输入（例如，光学RGB或SAR），限制了它们完全捕捉RS图像复杂的空间和光谱特征的能力。虽然多模态遥感数据可以提供补充信息以提高分类精度，但将多模态集成到SAM中存在重大挑战，包括模态适应、语义干扰和领域差距。在此基础上，我们提出FlexiSAM，一个基于sam的多模态语义分割模型，旨在克服这些挑战。FlexiSAM使用RGB作为主要模态，同时通过模块化管道无缝集成辅助RS模态。关键创新包括用于鲁棒跨模态特征融合和优化的动态多模态特征融合单元（DMMFU）和动态关注和上下文聚合混合器（DACAM），以及用于减轻模态引起的特征错位和确保连贯多模态集成的语义跨模态集成模块（SCMII）。然后由经过调整的SAM编码器进行处理，使用为RS数据量身定制的轻量级适配器进行增强，然后使用专用解码器产生精确的分类输出。在韩国、休斯顿2018和Mini-FLAIR数据集上进行了广泛的实验，使用罗建网进行核心评估，使用PyTorch进行交叉方法比较，证明了FlexiSAM的有效性和优越性，在韩国超过最先进的模型至少1.58%，在休斯顿2018上超过0.77%，在mIoU上超过1.14%。重要的是，与PyTorch相比，lojianet框架提供了更高的准确性和效率。FlexiSAM还展示了对不同RS模式的强大适应性和鲁棒性，将其建立为细粒度LULC分类的通用解决方案。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

FlexiSAM: A flexible SAM-based semantic segmentation model for land cover classification using high-resolution multimodal remote sensing imagery

Fine-grained land use and land cover (LULC) classification using high-resolution remote sensing (RS) imagery is fundamental to scientific research. Recently, the Segment Anything Model (SAM) has emerged as a major advance in deep learning-based LULC classification due to its robust segmentation and generalization capabilities. However, existing SAM-based models predominantly rely on single-modal inputs (e.g., optical RGB or SAR), limiting their ability to fully capture the complex spatial and spectral characteristics of RS imagery. Although multimodal RS data can provide complementary information to enhance classification accuracy, integrating multiple modalities into SAM presents significant challenges, including modality adaptation, semantic interference, and domain gaps. Building on this, we propose FlexiSAM, a SAM-based multimodal semantic segmentation model designed to overcome these challenges. FlexiSAM uses RGB as the primary modality while seamlessly integrating auxiliary RS modalities through a modular pipeline. Key innovations include the Dynamic Multimodal Feature Fusion Unit (DMMFU) and Dynamic Attention and the Context Aggregation Mixer (DACAM) for robust cross-modal feature fusion and refinement, and the Semantic Cross-Modal Integration Module (SCMII) for mitigating modality-induced feature misalignments and ensuring coherent multimodal integration. These are then processed by the adapted SAM encoder, enhanced with a lightweight adapter tailored for RS data, and followed by a dedicated decoder that produces precise classification outputs. Extensive experiments on the Korea, Houston2018, and Mini-FLAIR datasets, conducted using LuoJiaNET for core evaluations and PyTorch for cross-method comparisons, demonstrate FlexiSAM’s effectiveness and superiority, surpassing state-of-the-art models by at least 1.58% on Korea, 0.77% on Houston2018, and 1.14% in mIoU. Importantly, the LuoJiaNET framework delivers higher accuracy and efficiency compared to PyTorch. FlexiSAM also demonstrates strong adaptability and robustness across diverse RS modalities, establishing it as a versatile solution for fine-grained LULC classification.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

ISPRS Journal of Photogrammetry and Remote Sensing 工程技术-成像科学与照相技术

CiteScore

21.00

自引率

6.30%

发文量

273

审稿时长

40 days

期刊介绍： The ISPRS Journal of Photogrammetry and Remote Sensing (P&RS) serves as the official journal of the International Society for Photogrammetry and Remote Sensing (ISPRS). It acts as a platform for scientists and professionals worldwide who are involved in various disciplines that utilize photogrammetry, remote sensing, spatial information systems, computer vision, and related fields. The journal aims to facilitate communication and dissemination of advancements in these disciplines, while also acting as a comprehensive source of reference and archive. P&RS endeavors to publish high-quality, peer-reviewed research papers that are preferably original and have not been published before. These papers can cover scientific/research, technological development, or application/practical aspects. Additionally, the journal welcomes papers that are based on presentations from ISPRS meetings, as long as they are considered significant contributions to the aforementioned fields. In particular, P&RS encourages the submission of papers that are of broad scientific interest, showcase innovative applications (especially in emerging fields), have an interdisciplinary focus, discuss topics that have received limited attention in P&RS or related journals, or explore new directions in scientific or professional realms. It is preferred that theoretical papers include practical applications, while papers focusing on systems and applications should include a theoretical background.