RSGPT: A remote sensing vision language model and benchmark

IF 10.6 1区 地球科学 Q1 GEOGRAPHY, PHYSICAL
Yuan Hu , Jianlong Yuan , Congcong Wen , Xiaonan Lu , Yu Liu , Xiang Li
{"title":"RSGPT: A remote sensing vision language model and benchmark","authors":"Yuan Hu ,&nbsp;Jianlong Yuan ,&nbsp;Congcong Wen ,&nbsp;Xiaonan Lu ,&nbsp;Yu Liu ,&nbsp;Xiang Li","doi":"10.1016/j.isprsjprs.2025.03.028","DOIUrl":null,"url":null,"abstract":"<div><div>The emergence of large-scale Large Language Models (LLMs), with GPT-4 as a prominent example, has significantly propelled the rapid advancement of Artificial General Intelligence (AGI) and sparked the revolution of Artificial Intelligence 2.0. In the realm of remote sensing, there is a growing interest in developing large vision language models (VLMs) specifically tailored for data analysis in this domain. However, current research predominantly revolves around visual recognition tasks, lacking comprehensive, high-quality image–text datasets that are aligned and suitable for training large VLMs, which poses significant challenges to effectively training such models for remote sensing applications. In computer vision, recent research has demonstrated that fine-tuning large vision language models on small-scale, high-quality datasets can yield impressive performance in visual and language understanding. These results are comparable to state-of-the-art VLMs trained from scratch on massive amounts of data, such as GPT-4. Inspired by this captivating idea, in this work, we build a high-quality Remote Sensing Image Captioning dataset (<strong>RSICap</strong>) that facilitates the development of large VLMs in the remote sensing field. Unlike previous remote sensing datasets that either employ model-generated captions or short descriptions, RSICap comprises 2,585 human-annotated captions with rich and high-quality information. This dataset offers detailed descriptions for each image, encompassing scene descriptions (e.g., residential area, airport, or farmland) as well as object information (e.g., color, shape, quantity, absolute position, etc.). To facilitate the evaluation of VLMs in the field of remote sensing, we also provide a benchmark evaluation dataset called <strong>RSIEval</strong>. This dataset consists of human-annotated captions and visual question–answer pairs, allowing for a comprehensive assessment of VLMs in the context of remote sensing. We are actively engaged in expanding the scale of these two datasets to cover a broader spectrum of remote sensing image understanding tasks, further enhancing their utility and applicability. Our dataset and codes will be released at <span><span>https://github.com/Lavender105/RSGPT</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50269,"journal":{"name":"ISPRS Journal of Photogrammetry and Remote Sensing","volume":"224 ","pages":"Pages 272-286"},"PeriodicalIF":10.6000,"publicationDate":"2025-04-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ISPRS Journal of Photogrammetry and Remote Sensing","FirstCategoryId":"5","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0924271625001352","RegionNum":1,"RegionCategory":"地球科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"GEOGRAPHY, PHYSICAL","Score":null,"Total":0}
引用次数: 0

Abstract

The emergence of large-scale Large Language Models (LLMs), with GPT-4 as a prominent example, has significantly propelled the rapid advancement of Artificial General Intelligence (AGI) and sparked the revolution of Artificial Intelligence 2.0. In the realm of remote sensing, there is a growing interest in developing large vision language models (VLMs) specifically tailored for data analysis in this domain. However, current research predominantly revolves around visual recognition tasks, lacking comprehensive, high-quality image–text datasets that are aligned and suitable for training large VLMs, which poses significant challenges to effectively training such models for remote sensing applications. In computer vision, recent research has demonstrated that fine-tuning large vision language models on small-scale, high-quality datasets can yield impressive performance in visual and language understanding. These results are comparable to state-of-the-art VLMs trained from scratch on massive amounts of data, such as GPT-4. Inspired by this captivating idea, in this work, we build a high-quality Remote Sensing Image Captioning dataset (RSICap) that facilitates the development of large VLMs in the remote sensing field. Unlike previous remote sensing datasets that either employ model-generated captions or short descriptions, RSICap comprises 2,585 human-annotated captions with rich and high-quality information. This dataset offers detailed descriptions for each image, encompassing scene descriptions (e.g., residential area, airport, or farmland) as well as object information (e.g., color, shape, quantity, absolute position, etc.). To facilitate the evaluation of VLMs in the field of remote sensing, we also provide a benchmark evaluation dataset called RSIEval. This dataset consists of human-annotated captions and visual question–answer pairs, allowing for a comprehensive assessment of VLMs in the context of remote sensing. We are actively engaged in expanding the scale of these two datasets to cover a broader spectrum of remote sensing image understanding tasks, further enhancing their utility and applicability. Our dataset and codes will be released at https://github.com/Lavender105/RSGPT.
RSGPT:遥感视觉语言模型和基准
以 GPT-4 为代表的大型语言模型(LLM)的出现极大地推动了人工通用智能(AGI)的快速发展,并引发了人工智能 2.0 的革命。在遥感领域,人们对开发专门用于该领域数据分析的大型视觉语言模型(VLM)的兴趣与日俱增。然而,目前的研究主要围绕视觉识别任务展开,缺乏全面、高质量的图像-文本数据集,而这些数据集是对齐的,适合训练大型视觉语言模型,这为有效训练此类模型用于遥感应用带来了巨大挑战。在计算机视觉领域,最近的研究表明,在小规模、高质量的数据集上对大型视觉语言模型进行微调,可以在视觉和语言理解方面产生令人印象深刻的性能。这些结果可与在海量数据(如 GPT-4)上从头开始训练的最先进的视觉语言模型相媲美。受这一迷人想法的启发,我们在这项工作中建立了一个高质量的遥感图像标题数据集(RSICap),以促进遥感领域大型 VLM 的开发。以往的遥感数据集要么采用模型生成的标题,要么采用简短的描述,与此不同,RSICap 包含 2,585 个由人类标注的标题,信息丰富且质量高。该数据集为每幅图像提供了详细的描述,包括场景描述(如住宅区、机场或农田)以及物体信息(如颜色、形状、数量、绝对位置等)。为了便于在遥感领域对 VLM 进行评估,我们还提供了一个名为 RSIEval 的基准评估数据集。该数据集由人工标注的标题和视觉问答对组成,可对遥感背景下的 VLM 进行全面评估。我们正在积极扩大这两个数据集的规模,以涵盖更广泛的遥感图像理解任务,进一步提高其实用性和适用性。我们的数据集和代码将在 https://github.com/Lavender105/RSGPT 上发布。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
ISPRS Journal of Photogrammetry and Remote Sensing
ISPRS Journal of Photogrammetry and Remote Sensing 工程技术-成像科学与照相技术
CiteScore
21.00
自引率
6.30%
发文量
273
审稿时长
40 days
期刊介绍: The ISPRS Journal of Photogrammetry and Remote Sensing (P&RS) serves as the official journal of the International Society for Photogrammetry and Remote Sensing (ISPRS). It acts as a platform for scientists and professionals worldwide who are involved in various disciplines that utilize photogrammetry, remote sensing, spatial information systems, computer vision, and related fields. The journal aims to facilitate communication and dissemination of advancements in these disciplines, while also acting as a comprehensive source of reference and archive. P&RS endeavors to publish high-quality, peer-reviewed research papers that are preferably original and have not been published before. These papers can cover scientific/research, technological development, or application/practical aspects. Additionally, the journal welcomes papers that are based on presentations from ISPRS meetings, as long as they are considered significant contributions to the aforementioned fields. In particular, P&RS encourages the submission of papers that are of broad scientific interest, showcase innovative applications (especially in emerging fields), have an interdisciplinary focus, discuss topics that have received limited attention in P&RS or related journals, or explore new directions in scientific or professional realms. It is preferred that theoretical papers include practical applications, while papers focusing on systems and applications should include a theoretical background.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信