遥感视觉问答的频域迁移学习

IF 7.5 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Expert Systems with Applications Pub Date : 2025-06-13 DOI:10.1016/j.eswa.2025.128395

Ze Zhang , Enyuan Zhao , Ziyi Wan , Xinyue Liang , Min Ye , Jie Nie , Lei Huang

{"title":"遥感视觉问答的频域迁移学习","authors":"Ze Zhang , Enyuan Zhao , Ziyi Wan , Xinyue Liang , Min Ye , Jie Nie , Lei Huang","doi":"10.1016/j.eswa.2025.128395","DOIUrl":null,"url":null,"abstract":"<div><div>Remote Sensing Visual Question Answering (RSVQA) aims to parse the content of remote sensing images through multimodal interaction to accurately extract scientific knowledge. Current RSVQA methods typically fine-tune pre-trained models on specific datasets, which overlooks the mining of complex structured information in remote sensing images that are highly coupled with color, scale, and semantics. Additionally, these methods lack adequate handling of overfitting issues caused by the high complexity and noise attributes of remote sensing data, resulting in predictions that are neither comprehensive nor accurate. To mitigate this issue, this paper proposes a Parameter-Efficient Transfer Learning (PETL) method based on the frequency domain. By leveraging Fourier Transform, it captures the intricate structural information of complex remote sensing and enhances the generalizability across domains, data, and models. The main contributions of this paper are as follows: 1) We introduce an efficient and stable X-PFA framework for Remote Sensing Visual Question Answering (RSVQA). Here, ‘X’ denotes pretrained VLP models, and ‘PFA’ stands for Primary Frequency Adapter, which performs a Fast Fourier Transform (FFT) over the intermediate spatial domain features at each layer to produce the corresponding frequency representation, including an amplitude component (encoding scene-perceptual style such as texture, color, scene contrast) and a phase component (encoding rich semantics). The PFA adapts to the specific dataset distribution by learning salient features in the frequency domain. 2) Our proposed framework demonstrates stable and excellent performance across various pre-trained models, significantly mitigating overfitting issues on small datasets. On average, accuracy improves by 0.62 %, and stability increases by 28.5 %. 3) The PFA contains only 17.2 million trainable parameters. Compared to full-parameter fine-tuning, our approach reduces training time by 54.9 %, resulting in substantial training cost savings.</div></div>","PeriodicalId":50461,"journal":{"name":"Expert Systems with Applications","volume":"291 ","pages":"Article 128395"},"PeriodicalIF":7.5000,"publicationDate":"2025-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Frequency domain transfer learning for remote sensing visual question answering\",\"authors\":\"Ze Zhang , Enyuan Zhao , Ziyi Wan , Xinyue Liang , Min Ye , Jie Nie , Lei Huang\",\"doi\":\"10.1016/j.eswa.2025.128395\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Remote Sensing Visual Question Answering (RSVQA) aims to parse the content of remote sensing images through multimodal interaction to accurately extract scientific knowledge. Current RSVQA methods typically fine-tune pre-trained models on specific datasets, which overlooks the mining of complex structured information in remote sensing images that are highly coupled with color, scale, and semantics. Additionally, these methods lack adequate handling of overfitting issues caused by the high complexity and noise attributes of remote sensing data, resulting in predictions that are neither comprehensive nor accurate. To mitigate this issue, this paper proposes a Parameter-Efficient Transfer Learning (PETL) method based on the frequency domain. By leveraging Fourier Transform, it captures the intricate structural information of complex remote sensing and enhances the generalizability across domains, data, and models. The main contributions of this paper are as follows: 1) We introduce an efficient and stable X-PFA framework for Remote Sensing Visual Question Answering (RSVQA). Here, ‘X’ denotes pretrained VLP models, and ‘PFA’ stands for Primary Frequency Adapter, which performs a Fast Fourier Transform (FFT) over the intermediate spatial domain features at each layer to produce the corresponding frequency representation, including an amplitude component (encoding scene-perceptual style such as texture, color, scene contrast) and a phase component (encoding rich semantics). The PFA adapts to the specific dataset distribution by learning salient features in the frequency domain. 2) Our proposed framework demonstrates stable and excellent performance across various pre-trained models, significantly mitigating overfitting issues on small datasets. On average, accuracy improves by 0.62 %, and stability increases by 28.5 %. 3) The PFA contains only 17.2 million trainable parameters. Compared to full-parameter fine-tuning, our approach reduces training time by 54.9 %, resulting in substantial training cost savings.</div></div>\",\"PeriodicalId\":50461,\"journal\":{\"name\":\"Expert Systems with Applications\",\"volume\":\"291 \",\"pages\":\"Article 128395\"},\"PeriodicalIF\":7.5000,\"publicationDate\":\"2025-06-13\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Expert Systems with Applications\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0957417425020147\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Expert Systems with Applications","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0957417425020147","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

遥感视觉问答（RSVQA）旨在通过多模态交互对遥感影像内容进行解析，以准确提取科学知识。当前的RSVQA方法通常对特定数据集上的预训练模型进行微调，忽略了对遥感图像中与颜色、比例和语义高度耦合的复杂结构化信息的挖掘。此外，这些方法缺乏对遥感数据高复杂性和噪声属性导致的过拟合问题的充分处理，导致预测既不全面也不准确。为了解决这个问题，本文提出了一种基于频域的参数高效迁移学习（PETL）方法。利用傅里叶变换，捕获复杂遥感的复杂结构信息，增强了跨域、跨数据、跨模型的泛化能力。本文的主要贡献如下：1)提出了一种高效稳定的遥感视觉问答（RSVQA） X-PFA框架。这里，“X”表示预训练的VLP模型，“PFA”表示初级频率适配器，它在每层的中间空间域特征上执行快速傅里叶变换（FFT），以产生相应的频率表示，包括幅度分量（编码场景感知风格，如纹理、颜色、场景对比度）和相位分量（编码丰富的语义）。PFA通过学习频域中的显著特征来适应特定的数据集分布。2)我们提出的框架在各种预训练模型中表现出稳定和优异的性能，显著缓解了小数据集上的过拟合问题。平均精度提高0.62%，稳定性提高28.5%。3) PFA只包含1720万个可训练参数。与全参数微调相比，我们的方法减少了54.9%的训练时间，从而节省了大量的训练成本。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Frequency domain transfer learning for remote sensing visual question answering

Remote Sensing Visual Question Answering (RSVQA) aims to parse the content of remote sensing images through multimodal interaction to accurately extract scientific knowledge. Current RSVQA methods typically fine-tune pre-trained models on specific datasets, which overlooks the mining of complex structured information in remote sensing images that are highly coupled with color, scale, and semantics. Additionally, these methods lack adequate handling of overfitting issues caused by the high complexity and noise attributes of remote sensing data, resulting in predictions that are neither comprehensive nor accurate. To mitigate this issue, this paper proposes a Parameter-Efficient Transfer Learning (PETL) method based on the frequency domain. By leveraging Fourier Transform, it captures the intricate structural information of complex remote sensing and enhances the generalizability across domains, data, and models. The main contributions of this paper are as follows: 1) We introduce an efficient and stable X-PFA framework for Remote Sensing Visual Question Answering (RSVQA). Here, ‘X’ denotes pretrained VLP models, and ‘PFA’ stands for Primary Frequency Adapter, which performs a Fast Fourier Transform (FFT) over the intermediate spatial domain features at each layer to produce the corresponding frequency representation, including an amplitude component (encoding scene-perceptual style such as texture, color, scene contrast) and a phase component (encoding rich semantics). The PFA adapts to the specific dataset distribution by learning salient features in the frequency domain. 2) Our proposed framework demonstrates stable and excellent performance across various pre-trained models, significantly mitigating overfitting issues on small datasets. On average, accuracy improves by 0.62 %, and stability increases by 28.5 %. 3) The PFA contains only 17.2 million trainable parameters. Compared to full-parameter fine-tuning, our approach reduces training time by 54.9 %, resulting in substantial training cost savings.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Expert Systems with Applications 工程技术-工程：电子与电气

CiteScore

13.80

自引率

10.60%

发文量

2045

审稿时长

8.7 months

期刊介绍： Expert Systems With Applications is an international journal dedicated to the exchange of information on expert and intelligent systems used globally in industry, government, and universities. The journal emphasizes original papers covering the design, development, testing, implementation, and management of these systems, offering practical guidelines. It spans various sectors such as finance, engineering, marketing, law, project management, information management, medicine, and more. The journal also welcomes papers on multi-agent systems, knowledge management, neural networks, knowledge discovery, data mining, and other related areas, excluding applications to military/defense systems.