Domain generalization for semantic segmentation of remote sensing images via vision foundation model fine-tuning

IF 12.2 1区地球科学 Q1 GEOGRAPHY, PHYSICAL

ISPRS Journal of Photogrammetry and Remote Sensing Pub Date : 2025-09-17 DOI:10.1016/j.isprsjprs.2025.09.004

Muying Luo , Yujie Zan , Kourosh Khoshelham , Shunping Ji

{"title":"Domain generalization for semantic segmentation of remote sensing images via vision foundation model fine-tuning","authors":"Muying Luo , Yujie Zan , Kourosh Khoshelham , Shunping Ji","doi":"10.1016/j.isprsjprs.2025.09.004","DOIUrl":null,"url":null,"abstract":"<div><div>Practice-oriented and general-purpose deep semantic segmentation models are required to be effective in various application scenarios without heavy re-training or with minimum fine-tuning. This calls for the domain generalization ability of models. Vision Foundation Models (VFMs), trained on massive and diverse datasets, have shown impressive generalization capabilities in computer vision tasks. However, how to utilize their generalization ability for remote sensing cross-domain semantic segmentation remains understudied. In this paper, we explore to identify the most suitable VFM for remote sensing images and further enhance its generalization ability in the context of remote sensing image segmentation. Our study begins with a comprehensive generalization ability evaluation of various VFMs and classic CNN or transformer backbone networks under different settings. We discover that the DINO v2 ViT-L outperforms other backbones with frozen parameters or full fine-tuning. Building upon DINO v2, we propose a novel domain generalization framework from both data and deep feature perspectives. This framework incorporates two key modules, the Geospatial Semantic Adapter (GeoSA), and the Batch Style Augmenter (BaSA), which together unlock the potential of DINO v2 in remote sensing image semantic segmentation. GeoSA consists of three core components: enhancer, bridge and extractor. These components work synergistically to extract robust features from the pre-trained DINO v2 and generate multi-scale features adapted to remote sensing images. BaSA employs batch-level data augmentation to reduce reliance on dataset-specific features and promote domain-invariant learning. Extensive experiments across four remote sensing datasets and four domain generalization scenarios for both binary and multi-class semantic segmentation consistently demonstrate our method’s superior cross-domain generalization ability and robustness, surpassing advanced domain generalization methods and other VFM fine-tuning methods. Code will be released at <span><span>https://github.com/mmmll23/GeoSA-BaSA</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50269,"journal":{"name":"ISPRS Journal of Photogrammetry and Remote Sensing","volume":"230 ","pages":"Pages 126-146"},"PeriodicalIF":12.2000,"publicationDate":"2025-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ISPRS Journal of Photogrammetry and Remote Sensing","FirstCategoryId":"5","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0924271625003569","RegionNum":1,"RegionCategory":"地球科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"GEOGRAPHY, PHYSICAL","Score":null,"Total":0}

引用次数: 0

Abstract

Practice-oriented and general-purpose deep semantic segmentation models are required to be effective in various application scenarios without heavy re-training or with minimum fine-tuning. This calls for the domain generalization ability of models. Vision Foundation Models (VFMs), trained on massive and diverse datasets, have shown impressive generalization capabilities in computer vision tasks. However, how to utilize their generalization ability for remote sensing cross-domain semantic segmentation remains understudied. In this paper, we explore to identify the most suitable VFM for remote sensing images and further enhance its generalization ability in the context of remote sensing image segmentation. Our study begins with a comprehensive generalization ability evaluation of various VFMs and classic CNN or transformer backbone networks under different settings. We discover that the DINO v2 ViT-L outperforms other backbones with frozen parameters or full fine-tuning. Building upon DINO v2, we propose a novel domain generalization framework from both data and deep feature perspectives. This framework incorporates two key modules, the Geospatial Semantic Adapter (GeoSA), and the Batch Style Augmenter (BaSA), which together unlock the potential of DINO v2 in remote sensing image semantic segmentation. GeoSA consists of three core components: enhancer, bridge and extractor. These components work synergistically to extract robust features from the pre-trained DINO v2 and generate multi-scale features adapted to remote sensing images. BaSA employs batch-level data augmentation to reduce reliance on dataset-specific features and promote domain-invariant learning. Extensive experiments across four remote sensing datasets and four domain generalization scenarios for both binary and multi-class semantic segmentation consistently demonstrate our method’s superior cross-domain generalization ability and robustness, surpassing advanced domain generalization methods and other VFM fine-tuning methods. Code will be released at https://github.com/mmmll23/GeoSA-BaSA.

查看原文本刊更多论文

基于视觉基础模型微调的遥感图像语义分割领域泛化

面向实践和通用的深度语义分割模型需要在各种应用场景中有效，而不需要大量的重新训练或最小的微调。这就要求模型具有领域泛化能力。视觉基础模型（visual Foundation Models, VFMs）在大量和不同的数据集上训练，在计算机视觉任务中显示出令人印象深刻的泛化能力。然而，如何利用它们的泛化能力进行遥感跨域语义分割还有待进一步研究。在本文中，我们将探索识别最适合遥感图像的VFM，并进一步增强其在遥感图像分割中的泛化能力。我们的研究首先对各种vfm和经典CNN或变压器骨干网在不同设置下的综合泛化能力进行了评估。我们发现DINO v2 ViT-L优于其他具有冻结参数或完全微调的骨干。在DINO v2的基础上，我们从数据和深度特征的角度提出了一个新的领域泛化框架。该框架包含两个关键模块，地理空间语义适配器（GeoSA）和批处理样式增强器（BaSA），它们共同释放了DINO v2在遥感图像语义分割方面的潜力。GeoSA由三个核心部件组成：增强器、桥接器和萃取器。这些组件协同工作，从预训练的DINO v2中提取鲁棒特征，并生成适应遥感图像的多尺度特征。BaSA采用批处理级数据增强来减少对数据集特定特征的依赖，并促进域不变学习。在四种遥感数据集和四种二元和多类语义分割的领域泛化场景下进行的大量实验一致表明，我们的方法具有优越的跨领域泛化能力和鲁棒性，优于先进的领域泛化方法和其他VFM微调方法。代码将在https://github.com/mmmll23/GeoSA-BaSA上发布。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

ISPRS Journal of Photogrammetry and Remote Sensing 工程技术-成像科学与照相技术

CiteScore

21.00

自引率

6.30%

发文量

273

审稿时长

40 days

期刊介绍： The ISPRS Journal of Photogrammetry and Remote Sensing (P&RS) serves as the official journal of the International Society for Photogrammetry and Remote Sensing (ISPRS). It acts as a platform for scientists and professionals worldwide who are involved in various disciplines that utilize photogrammetry, remote sensing, spatial information systems, computer vision, and related fields. The journal aims to facilitate communication and dissemination of advancements in these disciplines, while also acting as a comprehensive source of reference and archive. P&RS endeavors to publish high-quality, peer-reviewed research papers that are preferably original and have not been published before. These papers can cover scientific/research, technological development, or application/practical aspects. Additionally, the journal welcomes papers that are based on presentations from ISPRS meetings, as long as they are considered significant contributions to the aforementioned fields. In particular, P&RS encourages the submission of papers that are of broad scientific interest, showcase innovative applications (especially in emerging fields), have an interdisciplinary focus, discuss topics that have received limited attention in P&RS or related journals, or explore new directions in scientific or professional realms. It is preferred that theoretical papers include practical applications, while papers focusing on systems and applications should include a theoretical background.