{"title":"Toward Multitask Perception for Remote Sensing Imagery via Compression and Prompt Tuning","authors":"Yongqiang Wang;Feng Liang;Hang Chen;Haisheng Fu;Jiro Katto","doi":"10.1109/LGRS.2025.3589030","DOIUrl":null,"url":null,"abstract":"Recently, advancements in satellite technology have greatly increased the availability of high-resolution remote sensing images. Concurrently, learning-based image compression (LIC) has significantly improved the efficiency of transmitting and storing such images. As machine recognition tasks increasingly depend on transmitting visual data across devices, compressed images play a key role in both human and machine perception during downstream tasks. However, most LIC approaches are not optimized for machine recognition tasks. To address this limitation, we propose a remote sensing image compression network called RSIC, which integrates multitask perception and supports downstream tasks such as object detection. Specifically, we introduce a wavelet-based frequency-spatial block (WFSB) that separates frequency components and processes them using transformer and convolutional neural network (CNN) blocks to effectively capture frequency-specific features. Within WFSB, the prompting Swin-Transformer block (PSTB) extracts spatial information while enabling prompt tuning. In addition, after primary codec training, instance and task prompts are applied during the encoding and decoding stages, respectively, facilitating machine perception without full fine-tuning. Extensive experimental results show that our model achieves better rate–distortion (R–D) performance for image compression on the aerial image dataset (AID) test dataset, surpassing the traditional versatile video coding (VVC) codec and several recent LIC methods. Furthermore, our method demonstrates superior performance in terms of rate–accuracy for machine perception on the Northwestern Polytechnical University Very-High-Resolution 10-Class Dataset (NWPU VHR-10) and High-Resolution SAR Images Dataset (HRSID) remote sensing datasets.","PeriodicalId":91017,"journal":{"name":"IEEE geoscience and remote sensing letters : a publication of the IEEE Geoscience and Remote Sensing Society","volume":"22 ","pages":"1-5"},"PeriodicalIF":0.0000,"publicationDate":"2025-07-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE geoscience and remote sensing letters : a publication of the IEEE Geoscience and Remote Sensing Society","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/11080011/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Recently, advancements in satellite technology have greatly increased the availability of high-resolution remote sensing images. Concurrently, learning-based image compression (LIC) has significantly improved the efficiency of transmitting and storing such images. As machine recognition tasks increasingly depend on transmitting visual data across devices, compressed images play a key role in both human and machine perception during downstream tasks. However, most LIC approaches are not optimized for machine recognition tasks. To address this limitation, we propose a remote sensing image compression network called RSIC, which integrates multitask perception and supports downstream tasks such as object detection. Specifically, we introduce a wavelet-based frequency-spatial block (WFSB) that separates frequency components and processes them using transformer and convolutional neural network (CNN) blocks to effectively capture frequency-specific features. Within WFSB, the prompting Swin-Transformer block (PSTB) extracts spatial information while enabling prompt tuning. In addition, after primary codec training, instance and task prompts are applied during the encoding and decoding stages, respectively, facilitating machine perception without full fine-tuning. Extensive experimental results show that our model achieves better rate–distortion (R–D) performance for image compression on the aerial image dataset (AID) test dataset, surpassing the traditional versatile video coding (VVC) codec and several recent LIC methods. Furthermore, our method demonstrates superior performance in terms of rate–accuracy for machine perception on the Northwestern Polytechnical University Very-High-Resolution 10-Class Dataset (NWPU VHR-10) and High-Resolution SAR Images Dataset (HRSID) remote sensing datasets.