{"title":"FFSNet: Adaptive features fusion of foundation models and self-supervised models for remote sensing image segmentation","authors":"Dunyou Liang, Feng Peng, Bing Wu, Xiaojun Cui, Haolin Zhuang, Guoyu Zhang","doi":"10.1016/j.dsp.2025.105634","DOIUrl":null,"url":null,"abstract":"<div><div>Remote sensing image segmentation is essential for urban planning, environmental monitoring, and disaster assessment but is challenged by scarce pixel-level annotations, domain shifts, and the difficulty of segmenting spectrally similar land cover classes. Existing methods struggle to address these issues comprehensively. Supervised approaches like UNetFormer require extensive labeled data and have limited generalization. Foundation models like SAM enable zero-shot segmentation but are constrained by high inference overhead, limiting their practical use in remote sensing. Self-supervised models like DINO capture domain-specific features but lack the global priors and generalization capabilities of large-scale foundation models, reducing their effectiveness in complex remote sensing scenarios. To overcome these limitations, FFSNet is proposed as a novel framework that integrates a lightweight MobileSAM encoder with a DINOv2 self-supervised encoder pretrained on remote sensing data. Its core innovation, the adaptive feature fusion module, balances general visual priors and domain-specific representations using attention-based dynamic weighting. Additionally, a modified category mask decoder extends binary output to multi-class segmentation using learnable prototype vectors. Experiments on three benchmark datasets validate the effectiveness of FFSNet. It achieves a mIoU of 55.4 % on LoveDA, surpassing D2lS, a mF1 of 88.3 % on ISPRS Potsdam, outperforming AerialFormer, and a mF1 of 91.6 % on Vaihingen, while using only 44.7 M parameters—a 50 % reduction compared to D2lS. FFSNet establishes a new paradigm for efficient domain adaptation in foundation models, offering superior segmentation accuracy with reduced computational costs, making it highly practical for large-scale remote sensing applications.</div></div>","PeriodicalId":51011,"journal":{"name":"Digital Signal Processing","volume":"168 ","pages":"Article 105634"},"PeriodicalIF":3.0000,"publicationDate":"2025-10-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Digital Signal Processing","FirstCategoryId":"5","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1051200425006566","RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
引用次数: 0
Abstract
Remote sensing image segmentation is essential for urban planning, environmental monitoring, and disaster assessment but is challenged by scarce pixel-level annotations, domain shifts, and the difficulty of segmenting spectrally similar land cover classes. Existing methods struggle to address these issues comprehensively. Supervised approaches like UNetFormer require extensive labeled data and have limited generalization. Foundation models like SAM enable zero-shot segmentation but are constrained by high inference overhead, limiting their practical use in remote sensing. Self-supervised models like DINO capture domain-specific features but lack the global priors and generalization capabilities of large-scale foundation models, reducing their effectiveness in complex remote sensing scenarios. To overcome these limitations, FFSNet is proposed as a novel framework that integrates a lightweight MobileSAM encoder with a DINOv2 self-supervised encoder pretrained on remote sensing data. Its core innovation, the adaptive feature fusion module, balances general visual priors and domain-specific representations using attention-based dynamic weighting. Additionally, a modified category mask decoder extends binary output to multi-class segmentation using learnable prototype vectors. Experiments on three benchmark datasets validate the effectiveness of FFSNet. It achieves a mIoU of 55.4 % on LoveDA, surpassing D2lS, a mF1 of 88.3 % on ISPRS Potsdam, outperforming AerialFormer, and a mF1 of 91.6 % on Vaihingen, while using only 44.7 M parameters—a 50 % reduction compared to D2lS. FFSNet establishes a new paradigm for efficient domain adaptation in foundation models, offering superior segmentation accuracy with reduced computational costs, making it highly practical for large-scale remote sensing applications.
期刊介绍:
Digital Signal Processing: A Review Journal is one of the oldest and most established journals in the field of signal processing yet it aims to be the most innovative. The Journal invites top quality research articles at the frontiers of research in all aspects of signal processing. Our objective is to provide a platform for the publication of ground-breaking research in signal processing with both academic and industrial appeal.
The journal has a special emphasis on statistical signal processing methodology such as Bayesian signal processing, and encourages articles on emerging applications of signal processing such as:
• big data• machine learning• internet of things• information security• systems biology and computational biology,• financial time series analysis,• autonomous vehicles,• quantum computing,• neuromorphic engineering,• human-computer interaction and intelligent user interfaces,• environmental signal processing,• geophysical signal processing including seismic signal processing,• chemioinformatics and bioinformatics,• audio, visual and performance arts,• disaster management and prevention,• renewable energy,