遥感视觉问答的频域迁移学习

IF 7.5 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE
Ze Zhang , Enyuan Zhao , Ziyi Wan , Xinyue Liang , Min Ye , Jie Nie , Lei Huang
{"title":"遥感视觉问答的频域迁移学习","authors":"Ze Zhang ,&nbsp;Enyuan Zhao ,&nbsp;Ziyi Wan ,&nbsp;Xinyue Liang ,&nbsp;Min Ye ,&nbsp;Jie Nie ,&nbsp;Lei Huang","doi":"10.1016/j.eswa.2025.128395","DOIUrl":null,"url":null,"abstract":"<div><div>Remote Sensing Visual Question Answering (RSVQA) aims to parse the content of remote sensing images through multimodal interaction to accurately extract scientific knowledge. Current RSVQA methods typically fine-tune pre-trained models on specific datasets, which overlooks the mining of complex structured information in remote sensing images that are highly coupled with color, scale, and semantics. Additionally, these methods lack adequate handling of overfitting issues caused by the high complexity and noise attributes of remote sensing data, resulting in predictions that are neither comprehensive nor accurate. To mitigate this issue, this paper proposes a Parameter-Efficient Transfer Learning (PETL) method based on the frequency domain. By leveraging Fourier Transform, it captures the intricate structural information of complex remote sensing and enhances the generalizability across domains, data, and models. The main contributions of this paper are as follows: 1) We introduce an efficient and stable X-PFA framework for Remote Sensing Visual Question Answering (RSVQA). Here, ‘X’ denotes pretrained VLP models, and ‘PFA’ stands for Primary Frequency Adapter, which performs a Fast Fourier Transform (FFT) over the intermediate spatial domain features at each layer to produce the corresponding frequency representation, including an amplitude component (encoding scene-perceptual style such as texture, color, scene contrast) and a phase component (encoding rich semantics). The PFA adapts to the specific dataset distribution by learning salient features in the frequency domain. 2) Our proposed framework demonstrates stable and excellent performance across various pre-trained models, significantly mitigating overfitting issues on small datasets. On average, accuracy improves by 0.62 %, and stability increases by 28.5 %. 3) The PFA contains only 17.2 million trainable parameters. Compared to full-parameter fine-tuning, our approach reduces training time by 54.9 %, resulting in substantial training cost savings.</div></div>","PeriodicalId":50461,"journal":{"name":"Expert Systems with Applications","volume":"291 ","pages":"Article 128395"},"PeriodicalIF":7.5000,"publicationDate":"2025-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Frequency domain transfer learning for remote sensing visual question answering\",\"authors\":\"Ze Zhang ,&nbsp;Enyuan Zhao ,&nbsp;Ziyi Wan ,&nbsp;Xinyue Liang ,&nbsp;Min Ye ,&nbsp;Jie Nie ,&nbsp;Lei Huang\",\"doi\":\"10.1016/j.eswa.2025.128395\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Remote Sensing Visual Question Answering (RSVQA) aims to parse the content of remote sensing images through multimodal interaction to accurately extract scientific knowledge. Current RSVQA methods typically fine-tune pre-trained models on specific datasets, which overlooks the mining of complex structured information in remote sensing images that are highly coupled with color, scale, and semantics. Additionally, these methods lack adequate handling of overfitting issues caused by the high complexity and noise attributes of remote sensing data, resulting in predictions that are neither comprehensive nor accurate. To mitigate this issue, this paper proposes a Parameter-Efficient Transfer Learning (PETL) method based on the frequency domain. By leveraging Fourier Transform, it captures the intricate structural information of complex remote sensing and enhances the generalizability across domains, data, and models. The main contributions of this paper are as follows: 1) We introduce an efficient and stable X-PFA framework for Remote Sensing Visual Question Answering (RSVQA). Here, ‘X’ denotes pretrained VLP models, and ‘PFA’ stands for Primary Frequency Adapter, which performs a Fast Fourier Transform (FFT) over the intermediate spatial domain features at each layer to produce the corresponding frequency representation, including an amplitude component (encoding scene-perceptual style such as texture, color, scene contrast) and a phase component (encoding rich semantics). The PFA adapts to the specific dataset distribution by learning salient features in the frequency domain. 2) Our proposed framework demonstrates stable and excellent performance across various pre-trained models, significantly mitigating overfitting issues on small datasets. On average, accuracy improves by 0.62 %, and stability increases by 28.5 %. 3) The PFA contains only 17.2 million trainable parameters. Compared to full-parameter fine-tuning, our approach reduces training time by 54.9 %, resulting in substantial training cost savings.</div></div>\",\"PeriodicalId\":50461,\"journal\":{\"name\":\"Expert Systems with Applications\",\"volume\":\"291 \",\"pages\":\"Article 128395\"},\"PeriodicalIF\":7.5000,\"publicationDate\":\"2025-06-13\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Expert Systems with Applications\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0957417425020147\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Expert Systems with Applications","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0957417425020147","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0

摘要

遥感视觉问答(RSVQA)旨在通过多模态交互对遥感影像内容进行解析,以准确提取科学知识。当前的RSVQA方法通常对特定数据集上的预训练模型进行微调,忽略了对遥感图像中与颜色、比例和语义高度耦合的复杂结构化信息的挖掘。此外,这些方法缺乏对遥感数据高复杂性和噪声属性导致的过拟合问题的充分处理,导致预测既不全面也不准确。为了解决这个问题,本文提出了一种基于频域的参数高效迁移学习(PETL)方法。利用傅里叶变换,捕获复杂遥感的复杂结构信息,增强了跨域、跨数据、跨模型的泛化能力。本文的主要贡献如下:1)提出了一种高效稳定的遥感视觉问答(RSVQA) X-PFA框架。这里,“X”表示预训练的VLP模型,“PFA”表示初级频率适配器,它在每层的中间空间域特征上执行快速傅里叶变换(FFT),以产生相应的频率表示,包括幅度分量(编码场景感知风格,如纹理、颜色、场景对比度)和相位分量(编码丰富的语义)。PFA通过学习频域中的显著特征来适应特定的数据集分布。2)我们提出的框架在各种预训练模型中表现出稳定和优异的性能,显著缓解了小数据集上的过拟合问题。平均精度提高0.62%,稳定性提高28.5%。3) PFA只包含1720万个可训练参数。与全参数微调相比,我们的方法减少了54.9%的训练时间,从而节省了大量的训练成本。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Frequency domain transfer learning for remote sensing visual question answering
Remote Sensing Visual Question Answering (RSVQA) aims to parse the content of remote sensing images through multimodal interaction to accurately extract scientific knowledge. Current RSVQA methods typically fine-tune pre-trained models on specific datasets, which overlooks the mining of complex structured information in remote sensing images that are highly coupled with color, scale, and semantics. Additionally, these methods lack adequate handling of overfitting issues caused by the high complexity and noise attributes of remote sensing data, resulting in predictions that are neither comprehensive nor accurate. To mitigate this issue, this paper proposes a Parameter-Efficient Transfer Learning (PETL) method based on the frequency domain. By leveraging Fourier Transform, it captures the intricate structural information of complex remote sensing and enhances the generalizability across domains, data, and models. The main contributions of this paper are as follows: 1) We introduce an efficient and stable X-PFA framework for Remote Sensing Visual Question Answering (RSVQA). Here, ‘X’ denotes pretrained VLP models, and ‘PFA’ stands for Primary Frequency Adapter, which performs a Fast Fourier Transform (FFT) over the intermediate spatial domain features at each layer to produce the corresponding frequency representation, including an amplitude component (encoding scene-perceptual style such as texture, color, scene contrast) and a phase component (encoding rich semantics). The PFA adapts to the specific dataset distribution by learning salient features in the frequency domain. 2) Our proposed framework demonstrates stable and excellent performance across various pre-trained models, significantly mitigating overfitting issues on small datasets. On average, accuracy improves by 0.62 %, and stability increases by 28.5 %. 3) The PFA contains only 17.2 million trainable parameters. Compared to full-parameter fine-tuning, our approach reduces training time by 54.9 %, resulting in substantial training cost savings.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Expert Systems with Applications
Expert Systems with Applications 工程技术-工程:电子与电气
CiteScore
13.80
自引率
10.60%
发文量
2045
审稿时长
8.7 months
期刊介绍: Expert Systems With Applications is an international journal dedicated to the exchange of information on expert and intelligent systems used globally in industry, government, and universities. The journal emphasizes original papers covering the design, development, testing, implementation, and management of these systems, offering practical guidelines. It spans various sectors such as finance, engineering, marketing, law, project management, information management, medicine, and more. The journal also welcomes papers on multi-agent systems, knowledge management, neural networks, knowledge discovery, data mining, and other related areas, excluding applications to military/defense systems.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信