Saliency supervised masked autoencoder pretrained salient location mining network for remote sensing image salient object detection

IF 10.6 1区地球科学 Q1 GEOGRAPHY, PHYSICAL

ISPRS Journal of Photogrammetry and Remote Sensing Pub Date : 2025-04-12 DOI:10.1016/j.isprsjprs.2025.03.025

Yuxiang Fu , Wei Fang , Victor S. Sheng

{"title":"Saliency supervised masked autoencoder pretrained salient location mining network for remote sensing image salient object detection","authors":"Yuxiang Fu , Wei Fang , Victor S. Sheng","doi":"10.1016/j.isprsjprs.2025.03.025","DOIUrl":null,"url":null,"abstract":"<div><div>Remote sensing image salient object detection (RSI-SOD), as an emerging topic in computer vision, has significant applications across various sectors, such as urban planning, environmental monitoring and disaster management, etc. In recent years, RSI-SOD has seen significant advancements, largely due to advanced representation learning methods and better architectures, such as convolutional neural networks and vision transformers. While current methods predominantly rely on supervised learning, there is potential for enhancement through self-supervised learning approaches, like masked autoencoder. However, we observed that the conventional use of masked autoencoder for pretraining encoders through masked image reconstruction yields subpar results in the context of RSI-SOD. To this end, we propose a novel approach: saliency supervised masked autoencoder (SSMAE) and a corresponding salient location mining network (SLMNet), which is pretrained by SSMAE for the task of RSI-SOD. SSMAE first uses masked autoencoder to reconstruct the masked image, and then employs SLMNet to predict saliency map from the reconstructed image, where saliency supervision is adopted to enable SLMNet to learn robust saliency prior knowledge. SLMNet has three major components: encoder, salient location mining module (SLMM) and the decoder. Specifically, SLMM employs residual multi-level fusion structure to mine the locations of salient objects from multi-scale features produced by the encoder. Later, the decoder fuses the multi-level features from SLMM and encoder to generate the prediction results. Comprehensive experiments on three public datasets demonstrate that our proposed method surpasses the state-of-the-art methods. Code is available at: <span><span>https://github.com/Voruarn/SLMNet</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50269,"journal":{"name":"ISPRS Journal of Photogrammetry and Remote Sensing","volume":"224 ","pages":"Pages 222-234"},"PeriodicalIF":10.6000,"publicationDate":"2025-04-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ISPRS Journal of Photogrammetry and Remote Sensing","FirstCategoryId":"5","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0924271625001236","RegionNum":1,"RegionCategory":"地球科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"GEOGRAPHY, PHYSICAL","Score":null,"Total":0}

引用次数: 0

Abstract

Remote sensing image salient object detection (RSI-SOD), as an emerging topic in computer vision, has significant applications across various sectors, such as urban planning, environmental monitoring and disaster management, etc. In recent years, RSI-SOD has seen significant advancements, largely due to advanced representation learning methods and better architectures, such as convolutional neural networks and vision transformers. While current methods predominantly rely on supervised learning, there is potential for enhancement through self-supervised learning approaches, like masked autoencoder. However, we observed that the conventional use of masked autoencoder for pretraining encoders through masked image reconstruction yields subpar results in the context of RSI-SOD. To this end, we propose a novel approach: saliency supervised masked autoencoder (SSMAE) and a corresponding salient location mining network (SLMNet), which is pretrained by SSMAE for the task of RSI-SOD. SSMAE first uses masked autoencoder to reconstruct the masked image, and then employs SLMNet to predict saliency map from the reconstructed image, where saliency supervision is adopted to enable SLMNet to learn robust saliency prior knowledge. SLMNet has three major components: encoder, salient location mining module (SLMM) and the decoder. Specifically, SLMM employs residual multi-level fusion structure to mine the locations of salient objects from multi-scale features produced by the encoder. Later, the decoder fuses the multi-level features from SLMM and encoder to generate the prediction results. Comprehensive experiments on three public datasets demonstrate that our proposed method surpasses the state-of-the-art methods. Code is available at: https://github.com/Voruarn/SLMNet.

查看原文本刊更多论文

显著性监督掩码自编码器预训练的显著性位置挖掘网络用于遥感图像显著性目标检测

遥感图像显著目标检测（RSI-SOD）作为计算机视觉领域的一个新兴课题，在城市规划、环境监测、灾害管理等领域有着重要的应用。近年来，RSI-SOD取得了重大进展，这主要归功于先进的表示学习方法和更好的架构，如卷积神经网络和视觉变压器。虽然目前的方法主要依赖于监督学习，但通过自监督学习方法（如掩码自动编码器）有增强的潜力。然而，我们观察到，在RSI-SOD的背景下，通过掩膜图像重建，传统使用掩膜自编码器进行预训练编码器产生了不理想的结果。为此，我们提出了一种新的方法：显著性监督掩蔽自编码器（SSMAE）和相应的显著性位置挖掘网络（SLMNet），该网络由SSMAE预训练用于RSI-SOD任务。该算法首先利用掩模自编码器对掩模图像进行重构，然后利用SLMNet对重构图像进行显著性映射预测，其中采用显著性监督，使SLMNet能够学习到鲁棒的显著性先验知识。SLMNet有三个主要组成部分：编码器、显著位置挖掘模块（SLMM）和解码器。具体而言，SLMM采用残差多级融合结构从编码器产生的多尺度特征中挖掘显著目标的位置。然后，解码器融合来自SLMM和编码器的多级特征来生成预测结果。在三个公共数据集上的综合实验表明，我们提出的方法优于目前最先进的方法。代码可从https://github.com/Voruarn/SLMNet获得。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

ISPRS Journal of Photogrammetry and Remote Sensing 工程技术-成像科学与照相技术

CiteScore

21.00

自引率

6.30%

发文量

273

审稿时长

40 days

期刊介绍： The ISPRS Journal of Photogrammetry and Remote Sensing (P&RS) serves as the official journal of the International Society for Photogrammetry and Remote Sensing (ISPRS). It acts as a platform for scientists and professionals worldwide who are involved in various disciplines that utilize photogrammetry, remote sensing, spatial information systems, computer vision, and related fields. The journal aims to facilitate communication and dissemination of advancements in these disciplines, while also acting as a comprehensive source of reference and archive. P&RS endeavors to publish high-quality, peer-reviewed research papers that are preferably original and have not been published before. These papers can cover scientific/research, technological development, or application/practical aspects. Additionally, the journal welcomes papers that are based on presentations from ISPRS meetings, as long as they are considered significant contributions to the aforementioned fields. In particular, P&RS encourages the submission of papers that are of broad scientific interest, showcase innovative applications (especially in emerging fields), have an interdisciplinary focus, discuss topics that have received limited attention in P&RS or related journals, or explore new directions in scientific or professional realms. It is preferred that theoretical papers include practical applications, while papers focusing on systems and applications should include a theoretical background.