Yuan Hu , Jianlong Yuan , Congcong Wen , Xiaonan Lu , Yu Liu , Xiang Li
{"title":"RSGPT: A remote sensing vision language model and benchmark","authors":"Yuan Hu , Jianlong Yuan , Congcong Wen , Xiaonan Lu , Yu Liu , Xiang Li","doi":"10.1016/j.isprsjprs.2025.03.028","DOIUrl":null,"url":null,"abstract":"<div><div>The emergence of large-scale Large Language Models (LLMs), with GPT-4 as a prominent example, has significantly propelled the rapid advancement of Artificial General Intelligence (AGI) and sparked the revolution of Artificial Intelligence 2.0. In the realm of remote sensing, there is a growing interest in developing large vision language models (VLMs) specifically tailored for data analysis in this domain. However, current research predominantly revolves around visual recognition tasks, lacking comprehensive, high-quality image–text datasets that are aligned and suitable for training large VLMs, which poses significant challenges to effectively training such models for remote sensing applications. In computer vision, recent research has demonstrated that fine-tuning large vision language models on small-scale, high-quality datasets can yield impressive performance in visual and language understanding. These results are comparable to state-of-the-art VLMs trained from scratch on massive amounts of data, such as GPT-4. Inspired by this captivating idea, in this work, we build a high-quality Remote Sensing Image Captioning dataset (<strong>RSICap</strong>) that facilitates the development of large VLMs in the remote sensing field. Unlike previous remote sensing datasets that either employ model-generated captions or short descriptions, RSICap comprises 2,585 human-annotated captions with rich and high-quality information. This dataset offers detailed descriptions for each image, encompassing scene descriptions (e.g., residential area, airport, or farmland) as well as object information (e.g., color, shape, quantity, absolute position, etc.). To facilitate the evaluation of VLMs in the field of remote sensing, we also provide a benchmark evaluation dataset called <strong>RSIEval</strong>. This dataset consists of human-annotated captions and visual question–answer pairs, allowing for a comprehensive assessment of VLMs in the context of remote sensing. We are actively engaged in expanding the scale of these two datasets to cover a broader spectrum of remote sensing image understanding tasks, further enhancing their utility and applicability. Our dataset and codes will be released at <span><span>https://github.com/Lavender105/RSGPT</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50269,"journal":{"name":"ISPRS Journal of Photogrammetry and Remote Sensing","volume":"224 ","pages":"Pages 272-286"},"PeriodicalIF":10.6000,"publicationDate":"2025-04-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ISPRS Journal of Photogrammetry and Remote Sensing","FirstCategoryId":"5","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0924271625001352","RegionNum":1,"RegionCategory":"地球科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"GEOGRAPHY, PHYSICAL","Score":null,"Total":0}
引用次数: 0
Abstract
The emergence of large-scale Large Language Models (LLMs), with GPT-4 as a prominent example, has significantly propelled the rapid advancement of Artificial General Intelligence (AGI) and sparked the revolution of Artificial Intelligence 2.0. In the realm of remote sensing, there is a growing interest in developing large vision language models (VLMs) specifically tailored for data analysis in this domain. However, current research predominantly revolves around visual recognition tasks, lacking comprehensive, high-quality image–text datasets that are aligned and suitable for training large VLMs, which poses significant challenges to effectively training such models for remote sensing applications. In computer vision, recent research has demonstrated that fine-tuning large vision language models on small-scale, high-quality datasets can yield impressive performance in visual and language understanding. These results are comparable to state-of-the-art VLMs trained from scratch on massive amounts of data, such as GPT-4. Inspired by this captivating idea, in this work, we build a high-quality Remote Sensing Image Captioning dataset (RSICap) that facilitates the development of large VLMs in the remote sensing field. Unlike previous remote sensing datasets that either employ model-generated captions or short descriptions, RSICap comprises 2,585 human-annotated captions with rich and high-quality information. This dataset offers detailed descriptions for each image, encompassing scene descriptions (e.g., residential area, airport, or farmland) as well as object information (e.g., color, shape, quantity, absolute position, etc.). To facilitate the evaluation of VLMs in the field of remote sensing, we also provide a benchmark evaluation dataset called RSIEval. This dataset consists of human-annotated captions and visual question–answer pairs, allowing for a comprehensive assessment of VLMs in the context of remote sensing. We are actively engaged in expanding the scale of these two datasets to cover a broader spectrum of remote sensing image understanding tasks, further enhancing their utility and applicability. Our dataset and codes will be released at https://github.com/Lavender105/RSGPT.
期刊介绍:
The ISPRS Journal of Photogrammetry and Remote Sensing (P&RS) serves as the official journal of the International Society for Photogrammetry and Remote Sensing (ISPRS). It acts as a platform for scientists and professionals worldwide who are involved in various disciplines that utilize photogrammetry, remote sensing, spatial information systems, computer vision, and related fields. The journal aims to facilitate communication and dissemination of advancements in these disciplines, while also acting as a comprehensive source of reference and archive.
P&RS endeavors to publish high-quality, peer-reviewed research papers that are preferably original and have not been published before. These papers can cover scientific/research, technological development, or application/practical aspects. Additionally, the journal welcomes papers that are based on presentations from ISPRS meetings, as long as they are considered significant contributions to the aforementioned fields.
In particular, P&RS encourages the submission of papers that are of broad scientific interest, showcase innovative applications (especially in emerging fields), have an interdisciplinary focus, discuss topics that have received limited attention in P&RS or related journals, or explore new directions in scientific or professional realms. It is preferred that theoretical papers include practical applications, while papers focusing on systems and applications should include a theoretical background.