Ze Zhang , Enyuan Zhao , Ziyi Wan , Xinyue Liang , Min Ye , Jie Nie , Lei Huang
{"title":"遥感视觉问答的频域迁移学习","authors":"Ze Zhang , Enyuan Zhao , Ziyi Wan , Xinyue Liang , Min Ye , Jie Nie , Lei Huang","doi":"10.1016/j.eswa.2025.128395","DOIUrl":null,"url":null,"abstract":"<div><div>Remote Sensing Visual Question Answering (RSVQA) aims to parse the content of remote sensing images through multimodal interaction to accurately extract scientific knowledge. Current RSVQA methods typically fine-tune pre-trained models on specific datasets, which overlooks the mining of complex structured information in remote sensing images that are highly coupled with color, scale, and semantics. Additionally, these methods lack adequate handling of overfitting issues caused by the high complexity and noise attributes of remote sensing data, resulting in predictions that are neither comprehensive nor accurate. To mitigate this issue, this paper proposes a Parameter-Efficient Transfer Learning (PETL) method based on the frequency domain. By leveraging Fourier Transform, it captures the intricate structural information of complex remote sensing and enhances the generalizability across domains, data, and models. The main contributions of this paper are as follows: 1) We introduce an efficient and stable X-PFA framework for Remote Sensing Visual Question Answering (RSVQA). Here, ‘X’ denotes pretrained VLP models, and ‘PFA’ stands for Primary Frequency Adapter, which performs a Fast Fourier Transform (FFT) over the intermediate spatial domain features at each layer to produce the corresponding frequency representation, including an amplitude component (encoding scene-perceptual style such as texture, color, scene contrast) and a phase component (encoding rich semantics). The PFA adapts to the specific dataset distribution by learning salient features in the frequency domain. 2) Our proposed framework demonstrates stable and excellent performance across various pre-trained models, significantly mitigating overfitting issues on small datasets. On average, accuracy improves by 0.62 %, and stability increases by 28.5 %. 3) The PFA contains only 17.2 million trainable parameters. Compared to full-parameter fine-tuning, our approach reduces training time by 54.9 %, resulting in substantial training cost savings.</div></div>","PeriodicalId":50461,"journal":{"name":"Expert Systems with Applications","volume":"291 ","pages":"Article 128395"},"PeriodicalIF":7.5000,"publicationDate":"2025-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Frequency domain transfer learning for remote sensing visual question answering\",\"authors\":\"Ze Zhang , Enyuan Zhao , Ziyi Wan , Xinyue Liang , Min Ye , Jie Nie , Lei Huang\",\"doi\":\"10.1016/j.eswa.2025.128395\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Remote Sensing Visual Question Answering (RSVQA) aims to parse the content of remote sensing images through multimodal interaction to accurately extract scientific knowledge. Current RSVQA methods typically fine-tune pre-trained models on specific datasets, which overlooks the mining of complex structured information in remote sensing images that are highly coupled with color, scale, and semantics. Additionally, these methods lack adequate handling of overfitting issues caused by the high complexity and noise attributes of remote sensing data, resulting in predictions that are neither comprehensive nor accurate. To mitigate this issue, this paper proposes a Parameter-Efficient Transfer Learning (PETL) method based on the frequency domain. By leveraging Fourier Transform, it captures the intricate structural information of complex remote sensing and enhances the generalizability across domains, data, and models. The main contributions of this paper are as follows: 1) We introduce an efficient and stable X-PFA framework for Remote Sensing Visual Question Answering (RSVQA). Here, ‘X’ denotes pretrained VLP models, and ‘PFA’ stands for Primary Frequency Adapter, which performs a Fast Fourier Transform (FFT) over the intermediate spatial domain features at each layer to produce the corresponding frequency representation, including an amplitude component (encoding scene-perceptual style such as texture, color, scene contrast) and a phase component (encoding rich semantics). The PFA adapts to the specific dataset distribution by learning salient features in the frequency domain. 2) Our proposed framework demonstrates stable and excellent performance across various pre-trained models, significantly mitigating overfitting issues on small datasets. On average, accuracy improves by 0.62 %, and stability increases by 28.5 %. 3) The PFA contains only 17.2 million trainable parameters. Compared to full-parameter fine-tuning, our approach reduces training time by 54.9 %, resulting in substantial training cost savings.</div></div>\",\"PeriodicalId\":50461,\"journal\":{\"name\":\"Expert Systems with Applications\",\"volume\":\"291 \",\"pages\":\"Article 128395\"},\"PeriodicalIF\":7.5000,\"publicationDate\":\"2025-06-13\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Expert Systems with Applications\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0957417425020147\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Expert Systems with Applications","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0957417425020147","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
Frequency domain transfer learning for remote sensing visual question answering
Remote Sensing Visual Question Answering (RSVQA) aims to parse the content of remote sensing images through multimodal interaction to accurately extract scientific knowledge. Current RSVQA methods typically fine-tune pre-trained models on specific datasets, which overlooks the mining of complex structured information in remote sensing images that are highly coupled with color, scale, and semantics. Additionally, these methods lack adequate handling of overfitting issues caused by the high complexity and noise attributes of remote sensing data, resulting in predictions that are neither comprehensive nor accurate. To mitigate this issue, this paper proposes a Parameter-Efficient Transfer Learning (PETL) method based on the frequency domain. By leveraging Fourier Transform, it captures the intricate structural information of complex remote sensing and enhances the generalizability across domains, data, and models. The main contributions of this paper are as follows: 1) We introduce an efficient and stable X-PFA framework for Remote Sensing Visual Question Answering (RSVQA). Here, ‘X’ denotes pretrained VLP models, and ‘PFA’ stands for Primary Frequency Adapter, which performs a Fast Fourier Transform (FFT) over the intermediate spatial domain features at each layer to produce the corresponding frequency representation, including an amplitude component (encoding scene-perceptual style such as texture, color, scene contrast) and a phase component (encoding rich semantics). The PFA adapts to the specific dataset distribution by learning salient features in the frequency domain. 2) Our proposed framework demonstrates stable and excellent performance across various pre-trained models, significantly mitigating overfitting issues on small datasets. On average, accuracy improves by 0.62 %, and stability increases by 28.5 %. 3) The PFA contains only 17.2 million trainable parameters. Compared to full-parameter fine-tuning, our approach reduces training time by 54.9 %, resulting in substantial training cost savings.
期刊介绍:
Expert Systems With Applications is an international journal dedicated to the exchange of information on expert and intelligent systems used globally in industry, government, and universities. The journal emphasizes original papers covering the design, development, testing, implementation, and management of these systems, offering practical guidelines. It spans various sectors such as finance, engineering, marketing, law, project management, information management, medicine, and more. The journal also welcomes papers on multi-agent systems, knowledge management, neural networks, knowledge discovery, data mining, and other related areas, excluding applications to military/defense systems.