RSTeller：从公开可用的数据和大型语言模型中使用丰富的语言语义来扩展遥感中的视觉语言建模

IF 10.6 1区地球科学 Q1 GEOGRAPHY, PHYSICAL

ISPRS Journal of Photogrammetry and Remote Sensing Pub Date : 2025-05-21 DOI:10.1016/j.isprsjprs.2025.05.002

Junyao Ge, Xu Zhang, Yang Zheng, Kaitai Guo, Jimin Liang

{"title":"RSTeller：从公开可用的数据和大型语言模型中使用丰富的语言语义来扩展遥感中的视觉语言建模","authors":"Junyao Ge, Xu Zhang, Yang Zheng, Kaitai Guo, Jimin Liang","doi":"10.1016/j.isprsjprs.2025.05.002","DOIUrl":null,"url":null,"abstract":"<div><div>Abundant, well-annotated multimodal data in remote sensing are pivotal for aligning complex visual remote sensing (RS) scenes with human language, enabling the development of specialized vision language models across diverse RS interpretation tasks. However, annotating RS images with rich linguistic semantics at scale demands expertise in RS and substantial human labor, making it costly and often impractical. In this study, we propose a workflow that leverages large language models (LLMs) to generate multimodal datasets with semantically rich captions at scale from plain OpenStreetMap (OSM) data for images sourced from the Google Earth Engine (GEE) platform. This approach facilitates the generation of paired remote sensing data and can be readily scaled up using openly available data. Within this framework, we present RSTeller, a multimodal dataset comprising over 1.3 million RS images, each accompanied by two descriptive captions. Extensive experiments demonstrate that RSTeller enhances the performance of multiple existing vision language models for RS scene understanding through continual pre-training. Our methodology significantly reduces the manual effort and expertise needed for annotating remote sensing imagery while democratizing access to high-quality annotated data. This advancement fosters progress in visual language modeling and encourages broader participation in remote sensing research and applications. The RSTeller dataset is available at <span><span>https://github.com/SlytherinGe/RSTeller</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50269,"journal":{"name":"ISPRS Journal of Photogrammetry and Remote Sensing","volume":"226 ","pages":"Pages 146-163"},"PeriodicalIF":10.6000,"publicationDate":"2025-05-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"RSTeller: Scaling up visual language modeling in remote sensing with rich linguistic semantics from openly available data and large language models\",\"authors\":\"Junyao Ge, Xu Zhang, Yang Zheng, Kaitai Guo, Jimin Liang\",\"doi\":\"10.1016/j.isprsjprs.2025.05.002\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Abundant, well-annotated multimodal data in remote sensing are pivotal for aligning complex visual remote sensing (RS) scenes with human language, enabling the development of specialized vision language models across diverse RS interpretation tasks. However, annotating RS images with rich linguistic semantics at scale demands expertise in RS and substantial human labor, making it costly and often impractical. In this study, we propose a workflow that leverages large language models (LLMs) to generate multimodal datasets with semantically rich captions at scale from plain OpenStreetMap (OSM) data for images sourced from the Google Earth Engine (GEE) platform. This approach facilitates the generation of paired remote sensing data and can be readily scaled up using openly available data. Within this framework, we present RSTeller, a multimodal dataset comprising over 1.3 million RS images, each accompanied by two descriptive captions. Extensive experiments demonstrate that RSTeller enhances the performance of multiple existing vision language models for RS scene understanding through continual pre-training. Our methodology significantly reduces the manual effort and expertise needed for annotating remote sensing imagery while democratizing access to high-quality annotated data. This advancement fosters progress in visual language modeling and encourages broader participation in remote sensing research and applications. The RSTeller dataset is available at <span><span>https://github.com/SlytherinGe/RSTeller</span><svg><path></path></svg></span>.</div></div>\",\"PeriodicalId\":50269,\"journal\":{\"name\":\"ISPRS Journal of Photogrammetry and Remote Sensing\",\"volume\":\"226 \",\"pages\":\"Pages 146-163\"},\"PeriodicalIF\":10.6000,\"publicationDate\":\"2025-05-21\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"ISPRS Journal of Photogrammetry and Remote Sensing\",\"FirstCategoryId\":\"5\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0924271625001832\",\"RegionNum\":1,\"RegionCategory\":\"地球科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"GEOGRAPHY, PHYSICAL\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"ISPRS Journal of Photogrammetry and Remote Sensing","FirstCategoryId":"5","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0924271625001832","RegionNum":1,"RegionCategory":"地球科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"GEOGRAPHY, PHYSICAL","Score":null,"Total":0}

引用次数: 0

摘要

丰富的、标注良好的遥感多模态数据对于将复杂的视觉遥感（RS）场景与人类语言对齐至关重要，从而能够在不同的RS解释任务中开发专门的视觉语言模型。然而，大规模标注具有丰富语言语义的RS图像需要RS方面的专业知识和大量人力，这使得它成本高昂且通常不切实际。在这项研究中，我们提出了一个利用大型语言模型（llm）的工作流，从来自谷歌Earth Engine （GEE）平台的图像的纯OpenStreetMap （OSM）数据中生成具有大规模语义丰富字幕的多模态数据集。这种方法有助于生成成对的遥感数据，并且可以利用公开可用的数据随时扩大规模。在这个框架内，我们提出了RSTeller，这是一个由130多万张RS图像组成的多模态数据集，每张图像都附有两个描述性说明。大量的实验表明，RSTeller通过持续的预训练提高了多种现有视觉语言模型在RS场景理解方面的性能。我们的方法显着减少了注释遥感图像所需的手工工作和专业知识，同时使高质量注释数据的访问民主化。这一进步促进了视觉语言建模的进步，并鼓励更广泛地参与遥感研究和应用。RSTeller数据集可从https://github.com/SlytherinGe/RSTeller获得。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

RSTeller: Scaling up visual language modeling in remote sensing with rich linguistic semantics from openly available data and large language models

Abundant, well-annotated multimodal data in remote sensing are pivotal for aligning complex visual remote sensing (RS) scenes with human language, enabling the development of specialized vision language models across diverse RS interpretation tasks. However, annotating RS images with rich linguistic semantics at scale demands expertise in RS and substantial human labor, making it costly and often impractical. In this study, we propose a workflow that leverages large language models (LLMs) to generate multimodal datasets with semantically rich captions at scale from plain OpenStreetMap (OSM) data for images sourced from the Google Earth Engine (GEE) platform. This approach facilitates the generation of paired remote sensing data and can be readily scaled up using openly available data. Within this framework, we present RSTeller, a multimodal dataset comprising over 1.3 million RS images, each accompanied by two descriptive captions. Extensive experiments demonstrate that RSTeller enhances the performance of multiple existing vision language models for RS scene understanding through continual pre-training. Our methodology significantly reduces the manual effort and expertise needed for annotating remote sensing imagery while democratizing access to high-quality annotated data. This advancement fosters progress in visual language modeling and encourages broader participation in remote sensing research and applications. The RSTeller dataset is available at https://github.com/SlytherinGe/RSTeller.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

ISPRS Journal of Photogrammetry and Remote Sensing 工程技术-成像科学与照相技术

CiteScore

21.00

自引率

6.30%

发文量

273

审稿时长

40 days

期刊介绍： The ISPRS Journal of Photogrammetry and Remote Sensing (P&RS) serves as the official journal of the International Society for Photogrammetry and Remote Sensing (ISPRS). It acts as a platform for scientists and professionals worldwide who are involved in various disciplines that utilize photogrammetry, remote sensing, spatial information systems, computer vision, and related fields. The journal aims to facilitate communication and dissemination of advancements in these disciplines, while also acting as a comprehensive source of reference and archive. P&RS endeavors to publish high-quality, peer-reviewed research papers that are preferably original and have not been published before. These papers can cover scientific/research, technological development, or application/practical aspects. Additionally, the journal welcomes papers that are based on presentations from ISPRS meetings, as long as they are considered significant contributions to the aforementioned fields. In particular, P&RS encourages the submission of papers that are of broad scientific interest, showcase innovative applications (especially in emerging fields), have an interdisciplinary focus, discuss topics that have received limited attention in P&RS or related journals, or explore new directions in scientific or professional realms. It is preferred that theoretical papers include practical applications, while papers focusing on systems and applications should include a theoretical background.