A Unified Framework With Multimodal Fine-Tuning for Remote Sensing Semantic Segmentation

IF 8.6 1区 地球科学 Q1 ENGINEERING, ELECTRICAL & ELECTRONIC
Xianping Ma;Xiaokang Zhang;Man-On Pun;Bo Huang
{"title":"A Unified Framework With Multimodal Fine-Tuning for Remote Sensing Semantic Segmentation","authors":"Xianping Ma;Xiaokang Zhang;Man-On Pun;Bo Huang","doi":"10.1109/TGRS.2025.3585238","DOIUrl":null,"url":null,"abstract":"Multimodal remote sensing data, acquired from diverse sensors, offer a comprehensive and integrated perspective of the Earth’s surface. Leveraging multimodal fusion techniques, semantic segmentation enables detailed and accurate analysis of geographic scenes, surpassing single-modality approaches. Building on advancements in vision foundation models, particularly the segment anything model (SAM), this study proposes a unified framework incorporating a novel multimodal fine-tuning network (MFNet) for remote sensing semantic segmentation. The proposed framework is designed to seamlessly integrate with various fine-tuning mechanisms, demonstrated through the inclusion of Adapter and low-rank adaptation (LoRA) as representative examples. This extensibility ensures the framework’s adaptability to other emerging fine-tuning strategies, allowing models to retain SAM’s general knowledge while effectively leveraging multimodal data. Additionally, a pyramid-based deep fusion module (DFM) is introduced to integrate high-level geographic features across multiple scales, enhancing feature representation prior to decoding. This work also highlights SAM’s robust generalization capabilities with digital surface model (DSM) data, a novel application. Extensive experiments on three benchmark multimodal remote sensing datasets, ISPRS Vaihingen, ISPRS Potsdam, and MMHunan, demonstrate that the proposed MFNet significantly outperforms existing methods in multimodal semantic segmentation, setting a new standard in the field while offering a versatile foundation for future research and applications. The source code for this work is accessible at <uri>https://github.com/sstary/SSRS</uri>.","PeriodicalId":13213,"journal":{"name":"IEEE Transactions on Geoscience and Remote Sensing","volume":"63 ","pages":"1-15"},"PeriodicalIF":8.6000,"publicationDate":"2025-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Geoscience and Remote Sensing","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/11063320/","RegionNum":1,"RegionCategory":"地球科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
引用次数: 0

Abstract

Multimodal remote sensing data, acquired from diverse sensors, offer a comprehensive and integrated perspective of the Earth’s surface. Leveraging multimodal fusion techniques, semantic segmentation enables detailed and accurate analysis of geographic scenes, surpassing single-modality approaches. Building on advancements in vision foundation models, particularly the segment anything model (SAM), this study proposes a unified framework incorporating a novel multimodal fine-tuning network (MFNet) for remote sensing semantic segmentation. The proposed framework is designed to seamlessly integrate with various fine-tuning mechanisms, demonstrated through the inclusion of Adapter and low-rank adaptation (LoRA) as representative examples. This extensibility ensures the framework’s adaptability to other emerging fine-tuning strategies, allowing models to retain SAM’s general knowledge while effectively leveraging multimodal data. Additionally, a pyramid-based deep fusion module (DFM) is introduced to integrate high-level geographic features across multiple scales, enhancing feature representation prior to decoding. This work also highlights SAM’s robust generalization capabilities with digital surface model (DSM) data, a novel application. Extensive experiments on three benchmark multimodal remote sensing datasets, ISPRS Vaihingen, ISPRS Potsdam, and MMHunan, demonstrate that the proposed MFNet significantly outperforms existing methods in multimodal semantic segmentation, setting a new standard in the field while offering a versatile foundation for future research and applications. The source code for this work is accessible at https://github.com/sstary/SSRS.
基于多模态微调的遥感语义分割统一框架
从不同传感器获取的多模态遥感数据提供了地球表面的全面和综合视角。利用多模态融合技术,语义分割可以实现详细和准确的地理场景分析,超越单模态方法。在视觉基础模型,特别是任意分割模型(SAM)的基础上,本研究提出了一种结合新型多模态微调网络(MFNet)的统一框架,用于遥感语义分割。所提出的框架旨在与各种微调机制无缝集成,通过包含适配器和低级别自适应(LoRA)作为代表性示例进行演示。这种可扩展性确保了框架对其他新出现的微调策略的适应性,允许模型在有效利用多模态数据的同时保留SAM的一般知识。此外,引入了基于金字塔的深度融合模块(DFM)来整合多尺度的高级地理特征,增强了解码前的特征表示。这项工作还强调了SAM对数字表面模型(DSM)数据的强大泛化能力,这是一种新的应用。在ISPRS Vaihingen、ISPRS Potsdam和MMHunan三个基准多模态遥感数据集上进行的大量实验表明,所提出的MFNet在多模态语义分割方面显著优于现有方法,为该领域树立了新的标准,同时为未来的研究和应用提供了广泛的基础。这项工作的源代码可在https://github.com/sstary/SSRS上访问。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
IEEE Transactions on Geoscience and Remote Sensing
IEEE Transactions on Geoscience and Remote Sensing 工程技术-地球化学与地球物理
CiteScore
11.50
自引率
28.00%
发文量
1912
审稿时长
4.0 months
期刊介绍: IEEE Transactions on Geoscience and Remote Sensing (TGRS) is a monthly publication that focuses on the theory, concepts, and techniques of science and engineering as applied to sensing the land, oceans, atmosphere, and space; and the processing, interpretation, and dissemination of this information.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信