Yichen Zhao , Yaxiong Chen , Ruilin Yao , Shengwu Xiong , Xiaoqiang Lu
{"title":"Context-driven and sparse decoding for Remote Sensing Visual Grounding","authors":"Yichen Zhao , Yaxiong Chen , Ruilin Yao , Shengwu Xiong , Xiaoqiang Lu","doi":"10.1016/j.inffus.2025.103296","DOIUrl":null,"url":null,"abstract":"<div><div>Remote Sensing Visual Grounding (RSVG) is an emerging multimodal RS task that involves grounding textual descriptions to specific objects in remote sensing images. Previous methods often overlook the impact of complex backgrounds and similar geographic entities during feature extraction, which may confuse target features and cause performance bottlenecks. Moreover, remote sensing scenes include extensive surface information, much of which contributes little to the reasoning of the target object. This redundancy not only increases the computational burden but also impairs decoding efficiency. To this end, we propose the Context-driven Sparse Decoding Network (CSDNet) for accurate grounding through multimodal context-aware feature extraction and text-guided sparse reasoning. To alleviate target feature confusion, a Text-aware Fusion Module (TFM) is introduced to refine the visual features using textual cues related to the image context. In addition, a Context-enhanced Interaction Module (CIM) is proposed to harmonize the differences between remote sensing images and text by modeling multimodal contexts. To tackle surface information redundancy, a Text-guided Sparse Decoder (TSD) is developed, which decouples image resolution from reasoning complexity by performing sparse sampling under text guidance. Extensive experiments on DIOR-RSVG, OPT-RSVG, and VRSBench benchmarks demonstrate the effectiveness of CSDNet. Remarkably, CSDNet utilizes only 5.12% of the visual features in performing cross-modal reasoning about the target object. The code is available at <span><span>https://github.com/WUTCM-Lab/CSDNet</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50367,"journal":{"name":"Information Fusion","volume":"123 ","pages":"Article 103296"},"PeriodicalIF":14.7000,"publicationDate":"2025-05-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information Fusion","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1566253525003690","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Remote Sensing Visual Grounding (RSVG) is an emerging multimodal RS task that involves grounding textual descriptions to specific objects in remote sensing images. Previous methods often overlook the impact of complex backgrounds and similar geographic entities during feature extraction, which may confuse target features and cause performance bottlenecks. Moreover, remote sensing scenes include extensive surface information, much of which contributes little to the reasoning of the target object. This redundancy not only increases the computational burden but also impairs decoding efficiency. To this end, we propose the Context-driven Sparse Decoding Network (CSDNet) for accurate grounding through multimodal context-aware feature extraction and text-guided sparse reasoning. To alleviate target feature confusion, a Text-aware Fusion Module (TFM) is introduced to refine the visual features using textual cues related to the image context. In addition, a Context-enhanced Interaction Module (CIM) is proposed to harmonize the differences between remote sensing images and text by modeling multimodal contexts. To tackle surface information redundancy, a Text-guided Sparse Decoder (TSD) is developed, which decouples image resolution from reasoning complexity by performing sparse sampling under text guidance. Extensive experiments on DIOR-RSVG, OPT-RSVG, and VRSBench benchmarks demonstrate the effectiveness of CSDNet. Remarkably, CSDNet utilizes only 5.12% of the visual features in performing cross-modal reasoning about the target object. The code is available at https://github.com/WUTCM-Lab/CSDNet.
期刊介绍:
Information Fusion serves as a central platform for showcasing advancements in multi-sensor, multi-source, multi-process information fusion, fostering collaboration among diverse disciplines driving its progress. It is the leading outlet for sharing research and development in this field, focusing on architectures, algorithms, and applications. Papers dealing with fundamental theoretical analyses as well as those demonstrating their application to real-world problems will be welcome.