Context-driven and sparse decoding for Remote Sensing Visual Grounding

IF 14.7 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Information Fusion Pub Date : 2025-05-22 DOI:10.1016/j.inffus.2025.103296

Yichen Zhao , Yaxiong Chen , Ruilin Yao , Shengwu Xiong , Xiaoqiang Lu

{"title":"Context-driven and sparse decoding for Remote Sensing Visual Grounding","authors":"Yichen Zhao , Yaxiong Chen , Ruilin Yao , Shengwu Xiong , Xiaoqiang Lu","doi":"10.1016/j.inffus.2025.103296","DOIUrl":null,"url":null,"abstract":"<div><div>Remote Sensing Visual Grounding (RSVG) is an emerging multimodal RS task that involves grounding textual descriptions to specific objects in remote sensing images. Previous methods often overlook the impact of complex backgrounds and similar geographic entities during feature extraction, which may confuse target features and cause performance bottlenecks. Moreover, remote sensing scenes include extensive surface information, much of which contributes little to the reasoning of the target object. This redundancy not only increases the computational burden but also impairs decoding efficiency. To this end, we propose the Context-driven Sparse Decoding Network (CSDNet) for accurate grounding through multimodal context-aware feature extraction and text-guided sparse reasoning. To alleviate target feature confusion, a Text-aware Fusion Module (TFM) is introduced to refine the visual features using textual cues related to the image context. In addition, a Context-enhanced Interaction Module (CIM) is proposed to harmonize the differences between remote sensing images and text by modeling multimodal contexts. To tackle surface information redundancy, a Text-guided Sparse Decoder (TSD) is developed, which decouples image resolution from reasoning complexity by performing sparse sampling under text guidance. Extensive experiments on DIOR-RSVG, OPT-RSVG, and VRSBench benchmarks demonstrate the effectiveness of CSDNet. Remarkably, CSDNet utilizes only 5.12% of the visual features in performing cross-modal reasoning about the target object. The code is available at <span><span>https://github.com/WUTCM-Lab/CSDNet</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50367,"journal":{"name":"Information Fusion","volume":"123 ","pages":"Article 103296"},"PeriodicalIF":14.7000,"publicationDate":"2025-05-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information Fusion","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1566253525003690","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Remote Sensing Visual Grounding (RSVG) is an emerging multimodal RS task that involves grounding textual descriptions to specific objects in remote sensing images. Previous methods often overlook the impact of complex backgrounds and similar geographic entities during feature extraction, which may confuse target features and cause performance bottlenecks. Moreover, remote sensing scenes include extensive surface information, much of which contributes little to the reasoning of the target object. This redundancy not only increases the computational burden but also impairs decoding efficiency. To this end, we propose the Context-driven Sparse Decoding Network (CSDNet) for accurate grounding through multimodal context-aware feature extraction and text-guided sparse reasoning. To alleviate target feature confusion, a Text-aware Fusion Module (TFM) is introduced to refine the visual features using textual cues related to the image context. In addition, a Context-enhanced Interaction Module (CIM) is proposed to harmonize the differences between remote sensing images and text by modeling multimodal contexts. To tackle surface information redundancy, a Text-guided Sparse Decoder (TSD) is developed, which decouples image resolution from reasoning complexity by performing sparse sampling under text guidance. Extensive experiments on DIOR-RSVG, OPT-RSVG, and VRSBench benchmarks demonstrate the effectiveness of CSDNet. Remarkably, CSDNet utilizes only 5.12% of the visual features in performing cross-modal reasoning about the target object. The code is available at https://github.com/WUTCM-Lab/CSDNet.

查看原文本刊更多论文

上下文驱动和稀疏解码的遥感视觉接地

遥感视觉接地（RSVG）是一种新兴的多模态遥感任务，涉及将遥感图像中特定目标的文本描述接地。以往的方法在特征提取过程中往往忽略了复杂背景和相似地理实体的影响，容易混淆目标特征，造成性能瓶颈。此外，遥感场景包含大量的地表信息，其中许多信息对目标物体的推理贡献不大。这种冗余不仅增加了计算量，而且降低了解码效率。为此，我们提出了上下文驱动的稀疏解码网络（CSDNet），通过多模态上下文感知特征提取和文本引导稀疏推理来精确接地。为了减轻目标特征的混淆，引入了文本感知融合模块（TFM），利用与图像上下文相关的文本线索来细化视觉特征。此外，还提出了一个上下文增强交互模块（CIM），通过建模多模态上下文来协调遥感图像和文本之间的差异。为了解决表面信息冗余问题，提出了一种文本引导稀疏解码器（TSD），该解码器在文本引导下进行稀疏采样，将图像分辨率与推理复杂度解耦。在DIOR-RSVG、OPT-RSVG和VRSBench基准测试上的大量实验证明了CSDNet的有效性。值得注意的是，CSDNet在对目标对象进行跨模态推理时仅利用了5.12%的视觉特征。代码可在https://github.com/WUTCM-Lab/CSDNet上获得。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Information Fusion 工程技术-计算机：理论方法

CiteScore

33.20

自引率

4.30%

发文量

161

审稿时长

7.9 months

期刊介绍： Information Fusion serves as a central platform for showcasing advancements in multi-sensor, multi-source, multi-process information fusion, fostering collaboration among diverse disciplines driving its progress. It is the leading outlet for sharing research and development in this field, focusing on architectures, algorithms, and applications. Papers dealing with fundamental theoretical analyses as well as those demonstrating their application to real-world problems will be welcome.