{"title":"Enhanced RSVQA Insight Through Synergistic Visual-Linguistic Attention Models","authors":"Anirban Saha;Suman Kumar Maji","doi":"10.1109/LGRS.2025.3592253","DOIUrl":null,"url":null,"abstract":"The interpretation of remote sensing images remains a significant challenge due to their complex, information-rich nature. Current remote sensing visual question answering (RSVQA) techniques have been a step forward toward building intelligent analysis systems for remote sensing images. However, most existing RSVQA models that rely on ResNet, VGG, and Swin transformers as visual feature extractors often fail to capture complex visual relationships, particularly the intricate dependencies between segmented regions and depth-related features in remote sensing data. To address these limitations, this letter introduces a novel RSVQA approach that leverages state-of-the-art components with an innovative architecture to advance interactive remote sensing analysis. The proposed model features a novel dual-layer visual attention mechanism in the representation module to process intricate features and capture regional relationships alongside processing the overall features. The fusion module employs a unique attention-based design, combining both self-attention and mutual attention, to integrate these features into a unified vector representation. Finally, the answering module utilizes a refined multilayer perceptron classifier for precise response generation. Evaluations on an RSVQA benchmark demonstrate the system’s superiority over existing methods, marking a significant step forward in remote sensing analytics.","PeriodicalId":91017,"journal":{"name":"IEEE geoscience and remote sensing letters : a publication of the IEEE Geoscience and Remote Sensing Society","volume":"22 ","pages":"1-5"},"PeriodicalIF":4.4000,"publicationDate":"2025-07-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE geoscience and remote sensing letters : a publication of the IEEE Geoscience and Remote Sensing Society","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/11095729/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
The interpretation of remote sensing images remains a significant challenge due to their complex, information-rich nature. Current remote sensing visual question answering (RSVQA) techniques have been a step forward toward building intelligent analysis systems for remote sensing images. However, most existing RSVQA models that rely on ResNet, VGG, and Swin transformers as visual feature extractors often fail to capture complex visual relationships, particularly the intricate dependencies between segmented regions and depth-related features in remote sensing data. To address these limitations, this letter introduces a novel RSVQA approach that leverages state-of-the-art components with an innovative architecture to advance interactive remote sensing analysis. The proposed model features a novel dual-layer visual attention mechanism in the representation module to process intricate features and capture regional relationships alongside processing the overall features. The fusion module employs a unique attention-based design, combining both self-attention and mutual attention, to integrate these features into a unified vector representation. Finally, the answering module utilizes a refined multilayer perceptron classifier for precise response generation. Evaluations on an RSVQA benchmark demonstrate the system’s superiority over existing methods, marking a significant step forward in remote sensing analytics.