FFSNet: Adaptive features fusion of foundation models and self-supervised models for remote sensing image segmentation

IF 3 3区工程技术 Q2 ENGINEERING, ELECTRICAL & ELECTRONIC

Digital Signal Processing Pub Date : 2025-10-03 DOI:10.1016/j.dsp.2025.105634

Dunyou Liang, Feng Peng, Bing Wu, Xiaojun Cui, Haolin Zhuang, Guoyu Zhang

{"title":"FFSNet: Adaptive features fusion of foundation models and self-supervised models for remote sensing image segmentation","authors":"Dunyou Liang, Feng Peng, Bing Wu, Xiaojun Cui, Haolin Zhuang, Guoyu Zhang","doi":"10.1016/j.dsp.2025.105634","DOIUrl":null,"url":null,"abstract":"<div><div>Remote sensing image segmentation is essential for urban planning, environmental monitoring, and disaster assessment but is challenged by scarce pixel-level annotations, domain shifts, and the difficulty of segmenting spectrally similar land cover classes. Existing methods struggle to address these issues comprehensively. Supervised approaches like UNetFormer require extensive labeled data and have limited generalization. Foundation models like SAM enable zero-shot segmentation but are constrained by high inference overhead, limiting their practical use in remote sensing. Self-supervised models like DINO capture domain-specific features but lack the global priors and generalization capabilities of large-scale foundation models, reducing their effectiveness in complex remote sensing scenarios. To overcome these limitations, FFSNet is proposed as a novel framework that integrates a lightweight MobileSAM encoder with a DINOv2 self-supervised encoder pretrained on remote sensing data. Its core innovation, the adaptive feature fusion module, balances general visual priors and domain-specific representations using attention-based dynamic weighting. Additionally, a modified category mask decoder extends binary output to multi-class segmentation using learnable prototype vectors. Experiments on three benchmark datasets validate the effectiveness of FFSNet. It achieves a mIoU of 55.4 % on LoveDA, surpassing D2lS, a mF1 of 88.3 % on ISPRS Potsdam, outperforming AerialFormer, and a mF1 of 91.6 % on Vaihingen, while using only 44.7 M parameters—a 50 % reduction compared to D2lS. FFSNet establishes a new paradigm for efficient domain adaptation in foundation models, offering superior segmentation accuracy with reduced computational costs, making it highly practical for large-scale remote sensing applications.</div></div>","PeriodicalId":51011,"journal":{"name":"Digital Signal Processing","volume":"168 ","pages":"Article 105634"},"PeriodicalIF":3.0000,"publicationDate":"2025-10-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Digital Signal Processing","FirstCategoryId":"5","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1051200425006566","RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}

引用次数: 0

Abstract

Remote sensing image segmentation is essential for urban planning, environmental monitoring, and disaster assessment but is challenged by scarce pixel-level annotations, domain shifts, and the difficulty of segmenting spectrally similar land cover classes. Existing methods struggle to address these issues comprehensively. Supervised approaches like UNetFormer require extensive labeled data and have limited generalization. Foundation models like SAM enable zero-shot segmentation but are constrained by high inference overhead, limiting their practical use in remote sensing. Self-supervised models like DINO capture domain-specific features but lack the global priors and generalization capabilities of large-scale foundation models, reducing their effectiveness in complex remote sensing scenarios. To overcome these limitations, FFSNet is proposed as a novel framework that integrates a lightweight MobileSAM encoder with a DINOv2 self-supervised encoder pretrained on remote sensing data. Its core innovation, the adaptive feature fusion module, balances general visual priors and domain-specific representations using attention-based dynamic weighting. Additionally, a modified category mask decoder extends binary output to multi-class segmentation using learnable prototype vectors. Experiments on three benchmark datasets validate the effectiveness of FFSNet. It achieves a mIoU of 55.4 % on LoveDA, surpassing D2lS, a mF1 of 88.3 % on ISPRS Potsdam, outperforming AerialFormer, and a mF1 of 91.6 % on Vaihingen, while using only 44.7 M parameters—a 50 % reduction compared to D2lS. FFSNet establishes a new paradigm for efficient domain adaptation in foundation models, offering superior segmentation accuracy with reduced computational costs, making it highly practical for large-scale remote sensing applications.

查看原文本刊更多论文

FFSNet：基于基础模型和自监督模型的自适应特征融合遥感图像分割

遥感图像分割对于城市规划、环境监测和灾害评估至关重要，但由于缺乏像素级注释、域转移和难以分割光谱相似的土地覆盖类别而受到挑战。现有的方法难以全面解决这些问题。像UNetFormer这样的监督式方法需要大量标记数据，泛化能力有限。像SAM这样的基础模型可以实现零射击分割，但受到高推断开销的限制，限制了它们在遥感中的实际应用。自监督模型（如DINO）捕获了特定领域的特征，但缺乏大规模基础模型的全局先验和泛化能力，降低了其在复杂遥感场景中的有效性。为了克服这些限制，FFSNet作为一种新的框架被提出，它集成了轻量级的MobileSAM编码器和对遥感数据进行预训练的DINOv2自监督编码器。它的核心创新是自适应特征融合模块，使用基于注意力的动态加权来平衡一般的视觉先验和特定领域的表示。此外，改进的类别掩码解码器使用可学习的原型向量将二进制输出扩展到多类分割。在三个基准数据集上的实验验证了FFSNet的有效性。它在LoveDA上实现了55.4%的mIoU，超过了D2lS，在ISPRS波茨坦上实现了58.3%的mF1，超过了AerialFormer，在Vaihingen上实现了91.6%的mF1，而只使用了44.7万个参数-与D2lS相比减少了50%。FFSNet为基础模型的高效域自适应建立了一种新的范式，在降低计算成本的同时提供了卓越的分割精度，使其在大规模遥感应用中具有很高的实用性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Digital Signal Processing 工程技术-工程：电子与电气

CiteScore

5.30

自引率

17.20%

发文量

435

审稿时长

66 days

期刊介绍： Digital Signal Processing: A Review Journal is one of the oldest and most established journals in the field of signal processing yet it aims to be the most innovative. The Journal invites top quality research articles at the frontiers of research in all aspects of signal processing. Our objective is to provide a platform for the publication of ground-breaking research in signal processing with both academic and industrial appeal. The journal has a special emphasis on statistical signal processing methodology such as Bayesian signal processing, and encourages articles on emerging applications of signal processing such as: • big data• machine learning• internet of things• information security• systems biology and computational biology,• financial time series analysis,• autonomous vehicles,• quantum computing,• neuromorphic engineering,• human-computer interaction and intelligent user interfaces,• environmental signal processing,• geophysical signal processing including seismic signal processing,• chemioinformatics and bioinformatics,• audio, visual and performance arts,• disaster management and prevention,• renewable energy,