{"title":"Visual Selection and Multistage Reasoning for RSVG","authors":"Yueli Ding;Haojie Xu;Di Wang;Ke Li;Yumin Tian","doi":"10.1109/LGRS.2024.3386311","DOIUrl":null,"url":null,"abstract":"Visual grounding of remote sensing (RSVG) is a task to locate targets indicated by referring expressions in remote sensing (RS) images. Previous approaches directly concatenate visual and language features and stack a series of transformer encoders for cross-modal fusion. However, this fusion strategy fails to fully leverage attributes and contextual information of the targets in referring expressions, limiting the performance of the existing methods. To address this issue, we propose a novel visual grounding framework for RSVG, named VSMR, which achieves accurate localization by adaptively selecting target-relevant features and performing multistage cross-modal reasoning. Specifically, we propose an adaptive feature selection (AFS) module, which automatically selects visual features relevant to queries while suppressing background noises. A multistage decoder (MSD) is designed to iteratively infer correlations between images and queries by leveraging abundant object attributes and contextual information in the referring expressions, thereby achieving accurate target localization. Experiments demonstrate that our method is superior to other state-of-the-art (SoTA) methods, achieving an accuracy of 78.24%.","PeriodicalId":91017,"journal":{"name":"IEEE geoscience and remote sensing letters : a publication of the IEEE Geoscience and Remote Sensing Society","volume":"21 ","pages":"1-5"},"PeriodicalIF":0.0000,"publicationDate":"2024-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE geoscience and remote sensing letters : a publication of the IEEE Geoscience and Remote Sensing Society","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10494585/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Visual grounding of remote sensing (RSVG) is a task to locate targets indicated by referring expressions in remote sensing (RS) images. Previous approaches directly concatenate visual and language features and stack a series of transformer encoders for cross-modal fusion. However, this fusion strategy fails to fully leverage attributes and contextual information of the targets in referring expressions, limiting the performance of the existing methods. To address this issue, we propose a novel visual grounding framework for RSVG, named VSMR, which achieves accurate localization by adaptively selecting target-relevant features and performing multistage cross-modal reasoning. Specifically, we propose an adaptive feature selection (AFS) module, which automatically selects visual features relevant to queries while suppressing background noises. A multistage decoder (MSD) is designed to iteratively infer correlations between images and queries by leveraging abundant object attributes and contextual information in the referring expressions, thereby achieving accurate target localization. Experiments demonstrate that our method is superior to other state-of-the-art (SoTA) methods, achieving an accuracy of 78.24%.