Toward Multitask Perception for Remote Sensing Imagery via Compression and Prompt Tuning

IEEE geoscience and remote sensing letters : a publication of the IEEE Geoscience and Remote Sensing Society Pub Date : 2025-07-14 DOI:10.1109/LGRS.2025.3589030

Yongqiang Wang;Feng Liang;Hang Chen;Haisheng Fu;Jiro Katto

{"title":"Toward Multitask Perception for Remote Sensing Imagery via Compression and Prompt Tuning","authors":"Yongqiang Wang;Feng Liang;Hang Chen;Haisheng Fu;Jiro Katto","doi":"10.1109/LGRS.2025.3589030","DOIUrl":null,"url":null,"abstract":"Recently, advancements in satellite technology have greatly increased the availability of high-resolution remote sensing images. Concurrently, learning-based image compression (LIC) has significantly improved the efficiency of transmitting and storing such images. As machine recognition tasks increasingly depend on transmitting visual data across devices, compressed images play a key role in both human and machine perception during downstream tasks. However, most LIC approaches are not optimized for machine recognition tasks. To address this limitation, we propose a remote sensing image compression network called RSIC, which integrates multitask perception and supports downstream tasks such as object detection. Specifically, we introduce a wavelet-based frequency-spatial block (WFSB) that separates frequency components and processes them using transformer and convolutional neural network (CNN) blocks to effectively capture frequency-specific features. Within WFSB, the prompting Swin-Transformer block (PSTB) extracts spatial information while enabling prompt tuning. In addition, after primary codec training, instance and task prompts are applied during the encoding and decoding stages, respectively, facilitating machine perception without full fine-tuning. Extensive experimental results show that our model achieves better rate–distortion (R–D) performance for image compression on the aerial image dataset (AID) test dataset, surpassing the traditional versatile video coding (VVC) codec and several recent LIC methods. Furthermore, our method demonstrates superior performance in terms of rate–accuracy for machine perception on the Northwestern Polytechnical University Very-High-Resolution 10-Class Dataset (NWPU VHR-10) and High-Resolution SAR Images Dataset (HRSID) remote sensing datasets.","PeriodicalId":91017,"journal":{"name":"IEEE geoscience and remote sensing letters : a publication of the IEEE Geoscience and Remote Sensing Society","volume":"22 ","pages":"1-5"},"PeriodicalIF":0.0000,"publicationDate":"2025-07-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE geoscience and remote sensing letters : a publication of the IEEE Geoscience and Remote Sensing Society","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/11080011/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Recently, advancements in satellite technology have greatly increased the availability of high-resolution remote sensing images. Concurrently, learning-based image compression (LIC) has significantly improved the efficiency of transmitting and storing such images. As machine recognition tasks increasingly depend on transmitting visual data across devices, compressed images play a key role in both human and machine perception during downstream tasks. However, most LIC approaches are not optimized for machine recognition tasks. To address this limitation, we propose a remote sensing image compression network called RSIC, which integrates multitask perception and supports downstream tasks such as object detection. Specifically, we introduce a wavelet-based frequency-spatial block (WFSB) that separates frequency components and processes them using transformer and convolutional neural network (CNN) blocks to effectively capture frequency-specific features. Within WFSB, the prompting Swin-Transformer block (PSTB) extracts spatial information while enabling prompt tuning. In addition, after primary codec training, instance and task prompts are applied during the encoding and decoding stages, respectively, facilitating machine perception without full fine-tuning. Extensive experimental results show that our model achieves better rate–distortion (R–D) performance for image compression on the aerial image dataset (AID) test dataset, surpassing the traditional versatile video coding (VVC) codec and several recent LIC methods. Furthermore, our method demonstrates superior performance in terms of rate–accuracy for machine perception on the Northwestern Polytechnical University Very-High-Resolution 10-Class Dataset (NWPU VHR-10) and High-Resolution SAR Images Dataset (HRSID) remote sensing datasets.

查看原文本刊更多论文

基于压缩和快速调整的遥感图像多任务感知研究

最近，卫星技术的进步大大增加了高分辨率遥感图像的可用性。同时，基于学习的图像压缩（LIC）极大地提高了此类图像的传输和存储效率。由于机器识别任务越来越依赖于跨设备传输视觉数据，压缩图像在下游任务中对人和机器感知都起着关键作用。然而，大多数LIC方法并没有针对机器识别任务进行优化。为了解决这一限制，我们提出了一种称为RSIC的遥感图像压缩网络，它集成了多任务感知并支持下游任务，如目标检测。具体来说，我们引入了一种基于小波的频率空间块（WFSB），它分离频率分量，并使用变压器和卷积神经网络（CNN）块对它们进行处理，以有效地捕获频率特定特征。在WFSB中，提示swing - transformer块（PSTB）在启用提示调优的同时提取空间信息。此外，经过初级编解码器训练后，在编码和解码阶段分别应用实例和任务提示，方便机器感知而无需完全微调。大量的实验结果表明，我们的模型在航空图像数据集（AID）测试数据集上获得了更好的图像压缩率失真（R-D）性能，超过了传统的通用视频编码（VVC）编解码器和最近的几种LIC方法。此外，我们的方法在西北工业大学高分辨率10类数据集（NWPU VHR-10）和高分辨率SAR图像数据集（HRSID）遥感数据集上的机器感知率精度方面表现出色。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE geoscience and remote sensing letters : a publication of the IEEE Geoscience and Remote Sensing Society

自引率

0.00%

发文量