RSVG 的视觉选择和多阶段推理

Yueli Ding;Haojie Xu;Di Wang;Ke Li;Yumin Tian
{"title":"RSVG 的视觉选择和多阶段推理","authors":"Yueli Ding;Haojie Xu;Di Wang;Ke Li;Yumin Tian","doi":"10.1109/LGRS.2024.3386311","DOIUrl":null,"url":null,"abstract":"Visual grounding of remote sensing (RSVG) is a task to locate targets indicated by referring expressions in remote sensing (RS) images. Previous approaches directly concatenate visual and language features and stack a series of transformer encoders for cross-modal fusion. However, this fusion strategy fails to fully leverage attributes and contextual information of the targets in referring expressions, limiting the performance of the existing methods. To address this issue, we propose a novel visual grounding framework for RSVG, named VSMR, which achieves accurate localization by adaptively selecting target-relevant features and performing multistage cross-modal reasoning. Specifically, we propose an adaptive feature selection (AFS) module, which automatically selects visual features relevant to queries while suppressing background noises. A multistage decoder (MSD) is designed to iteratively infer correlations between images and queries by leveraging abundant object attributes and contextual information in the referring expressions, thereby achieving accurate target localization. Experiments demonstrate that our method is superior to other state-of-the-art (SoTA) methods, achieving an accuracy of 78.24%.","PeriodicalId":91017,"journal":{"name":"IEEE geoscience and remote sensing letters : a publication of the IEEE Geoscience and Remote Sensing Society","volume":"21 ","pages":"1-5"},"PeriodicalIF":0.0000,"publicationDate":"2024-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Visual Selection and Multistage Reasoning for RSVG\",\"authors\":\"Yueli Ding;Haojie Xu;Di Wang;Ke Li;Yumin Tian\",\"doi\":\"10.1109/LGRS.2024.3386311\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Visual grounding of remote sensing (RSVG) is a task to locate targets indicated by referring expressions in remote sensing (RS) images. Previous approaches directly concatenate visual and language features and stack a series of transformer encoders for cross-modal fusion. However, this fusion strategy fails to fully leverage attributes and contextual information of the targets in referring expressions, limiting the performance of the existing methods. To address this issue, we propose a novel visual grounding framework for RSVG, named VSMR, which achieves accurate localization by adaptively selecting target-relevant features and performing multistage cross-modal reasoning. Specifically, we propose an adaptive feature selection (AFS) module, which automatically selects visual features relevant to queries while suppressing background noises. A multistage decoder (MSD) is designed to iteratively infer correlations between images and queries by leveraging abundant object attributes and contextual information in the referring expressions, thereby achieving accurate target localization. Experiments demonstrate that our method is superior to other state-of-the-art (SoTA) methods, achieving an accuracy of 78.24%.\",\"PeriodicalId\":91017,\"journal\":{\"name\":\"IEEE geoscience and remote sensing letters : a publication of the IEEE Geoscience and Remote Sensing Society\",\"volume\":\"21 \",\"pages\":\"1-5\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-04-08\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE geoscience and remote sensing letters : a publication of the IEEE Geoscience and Remote Sensing Society\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10494585/\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE geoscience and remote sensing letters : a publication of the IEEE Geoscience and Remote Sensing Society","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10494585/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

遥感视觉定位(RSVG)是一项在遥感(RS)图像中定位指代表达所指示目标的任务。以往的方法是直接将视觉和语言特征串联起来,并堆叠一系列转换编码器进行跨模态融合。然而,这种融合策略无法充分利用指代表达中目标的属性和上下文信息,从而限制了现有方法的性能。为了解决这个问题,我们提出了一种新颖的 RSVG 视觉定位框架,名为 VSMR,它通过自适应选择目标相关特征和执行多阶段跨模态推理来实现精确定位。具体来说,我们提出了一个自适应特征选择(AFS)模块,它能自动选择与查询相关的视觉特征,同时抑制背景噪音。我们设计了一个多阶段解码器(MSD),利用丰富的对象属性和引用表达中的上下文信息,迭代推断图像和查询之间的相关性,从而实现准确的目标定位。实验证明,我们的方法优于其他最先进的(SoTA)方法,准确率达到 78.24%。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Visual Selection and Multistage Reasoning for RSVG
Visual grounding of remote sensing (RSVG) is a task to locate targets indicated by referring expressions in remote sensing (RS) images. Previous approaches directly concatenate visual and language features and stack a series of transformer encoders for cross-modal fusion. However, this fusion strategy fails to fully leverage attributes and contextual information of the targets in referring expressions, limiting the performance of the existing methods. To address this issue, we propose a novel visual grounding framework for RSVG, named VSMR, which achieves accurate localization by adaptively selecting target-relevant features and performing multistage cross-modal reasoning. Specifically, we propose an adaptive feature selection (AFS) module, which automatically selects visual features relevant to queries while suppressing background noises. A multistage decoder (MSD) is designed to iteratively infer correlations between images and queries by leveraging abundant object attributes and contextual information in the referring expressions, thereby achieving accurate target localization. Experiments demonstrate that our method is superior to other state-of-the-art (SoTA) methods, achieving an accuracy of 78.24%.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信