A Unified Framework With Multimodal Fine-Tuning for Remote Sensing Semantic Segmentation

IF 8.6 1区地球科学 Q1 ENGINEERING, ELECTRICAL & ELECTRONIC

IEEE Transactions on Geoscience and Remote Sensing Pub Date : 2025-07-02 DOI:10.1109/TGRS.2025.3585238

Xianping Ma;Xiaokang Zhang;Man-On Pun;Bo Huang

{"title":"A Unified Framework With Multimodal Fine-Tuning for Remote Sensing Semantic Segmentation","authors":"Xianping Ma;Xiaokang Zhang;Man-On Pun;Bo Huang","doi":"10.1109/TGRS.2025.3585238","DOIUrl":null,"url":null,"abstract":"Multimodal remote sensing data, acquired from diverse sensors, offer a comprehensive and integrated perspective of the Earth’s surface. Leveraging multimodal fusion techniques, semantic segmentation enables detailed and accurate analysis of geographic scenes, surpassing single-modality approaches. Building on advancements in vision foundation models, particularly the segment anything model (SAM), this study proposes a unified framework incorporating a novel multimodal fine-tuning network (MFNet) for remote sensing semantic segmentation. The proposed framework is designed to seamlessly integrate with various fine-tuning mechanisms, demonstrated through the inclusion of Adapter and low-rank adaptation (LoRA) as representative examples. This extensibility ensures the framework’s adaptability to other emerging fine-tuning strategies, allowing models to retain SAM’s general knowledge while effectively leveraging multimodal data. Additionally, a pyramid-based deep fusion module (DFM) is introduced to integrate high-level geographic features across multiple scales, enhancing feature representation prior to decoding. This work also highlights SAM’s robust generalization capabilities with digital surface model (DSM) data, a novel application. Extensive experiments on three benchmark multimodal remote sensing datasets, ISPRS Vaihingen, ISPRS Potsdam, and MMHunan, demonstrate that the proposed MFNet significantly outperforms existing methods in multimodal semantic segmentation, setting a new standard in the field while offering a versatile foundation for future research and applications. The source code for this work is accessible at <uri>https://github.com/sstary/SSRS</uri>.","PeriodicalId":13213,"journal":{"name":"IEEE Transactions on Geoscience and Remote Sensing","volume":"63 ","pages":"1-15"},"PeriodicalIF":8.6000,"publicationDate":"2025-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Geoscience and Remote Sensing","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/11063320/","RegionNum":1,"RegionCategory":"地球科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}

引用次数: 0

Abstract

Multimodal remote sensing data, acquired from diverse sensors, offer a comprehensive and integrated perspective of the Earth’s surface. Leveraging multimodal fusion techniques, semantic segmentation enables detailed and accurate analysis of geographic scenes, surpassing single-modality approaches. Building on advancements in vision foundation models, particularly the segment anything model (SAM), this study proposes a unified framework incorporating a novel multimodal fine-tuning network (MFNet) for remote sensing semantic segmentation. The proposed framework is designed to seamlessly integrate with various fine-tuning mechanisms, demonstrated through the inclusion of Adapter and low-rank adaptation (LoRA) as representative examples. This extensibility ensures the framework’s adaptability to other emerging fine-tuning strategies, allowing models to retain SAM’s general knowledge while effectively leveraging multimodal data. Additionally, a pyramid-based deep fusion module (DFM) is introduced to integrate high-level geographic features across multiple scales, enhancing feature representation prior to decoding. This work also highlights SAM’s robust generalization capabilities with digital surface model (DSM) data, a novel application. Extensive experiments on three benchmark multimodal remote sensing datasets, ISPRS Vaihingen, ISPRS Potsdam, and MMHunan, demonstrate that the proposed MFNet significantly outperforms existing methods in multimodal semantic segmentation, setting a new standard in the field while offering a versatile foundation for future research and applications. The source code for this work is accessible at https://github.com/sstary/SSRS.

查看原文本刊更多论文

基于多模态微调的遥感语义分割统一框架

从不同传感器获取的多模态遥感数据提供了地球表面的全面和综合视角。利用多模态融合技术，语义分割可以实现详细和准确的地理场景分析，超越单模态方法。在视觉基础模型，特别是任意分割模型（SAM）的基础上，本研究提出了一种结合新型多模态微调网络（MFNet）的统一框架，用于遥感语义分割。所提出的框架旨在与各种微调机制无缝集成，通过包含适配器和低级别自适应（LoRA）作为代表性示例进行演示。这种可扩展性确保了框架对其他新出现的微调策略的适应性，允许模型在有效利用多模态数据的同时保留SAM的一般知识。此外，引入了基于金字塔的深度融合模块（DFM）来整合多尺度的高级地理特征，增强了解码前的特征表示。这项工作还强调了SAM对数字表面模型（DSM）数据的强大泛化能力，这是一种新的应用。在ISPRS Vaihingen、ISPRS Potsdam和MMHunan三个基准多模态遥感数据集上进行的大量实验表明，所提出的MFNet在多模态语义分割方面显著优于现有方法，为该领域树立了新的标准，同时为未来的研究和应用提供了广泛的基础。这项工作的源代码可在https://github.com/sstary/SSRS上访问。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Transactions on Geoscience and Remote Sensing 工程技术-地球化学与地球物理

CiteScore

11.50

自引率

28.00%

发文量

1912

审稿时长

4.0 months

期刊介绍： IEEE Transactions on Geoscience and Remote Sensing (TGRS) is a monthly publication that focuses on the theory, concepts, and techniques of science and engineering as applied to sensing the land, oceans, atmosphere, and space; and the processing, interpretation, and dissemination of this information.