{"title":"Integrating remote sensing with OpenStreetMap data for comprehensive scene understanding through multi-modal self-supervised learning","authors":"Lubin Bai, Xiuyuan Zhang, Haoyu Wang, Shihong Du","doi":"10.1016/j.rse.2024.114573","DOIUrl":null,"url":null,"abstract":"OpenStreetMap (OSM) contains valuable geographic knowledge for remote sensing (RS) interpretation. They can provide correlated and complementary descriptions of a given region. Integrating RS images with OSM data can lead to a more comprehensive understanding of a geographic scene. But due to the significant differences between them, little progress has been made in data fusion for RS and OSM data, and how to extract, interact, and collaborate the information from multiple geographic data sources remains largely unexplored. In this work, we focus on designing a multi-modal self-supervised learning (SSL) approach to fuse RS images and OSM data, which can extract meaningful features from the two complementary data sources in an unsupervised manner, resulting in comprehensive scene understanding. We harmonize the parts of information extraction, interaction, and collaboration for RS and OSM data into a unified SSL framework, named Rose. For information extraction, we start from the complementarity between the two modalities, designing an OSM encoder to harmoniously align with the ViT image encoder. For information interaction, we leverage the spatial correlation between RS and OSM data to guide the cross-attention module, thereby enhancing the information transfer. For information collaboration, we design the joint mask-reconstruction learning strategy to achieve cooperation between the two modalities, which reconstructs the original inputs by referring to information from both sources. The three parts are interlinked and blending seamlessly into a unified framework. Finally, Rose can generate three kinds of representations, i.e., RS feature, OSM feature, and RS-OSM fusion feature, which can be used for multiple downstream tasks. Extensive experiments on land use semantic segmentation, population estimation, and carbon emission estimation tasks demonstrate the multitasking capability, label efficiency, and robustness to noise of Rose. Rose can associate RS images and OSM data at a fine level of granularity, enhancing its effectiveness on fine-grained tasks like land use semantic segmentation. The code can be found at <span><span>https://github.com/bailubin/Rose</span><svg aria-label=\"Opens in new window\" focusable=\"false\" height=\"20\" viewbox=\"0 0 8 8\"><path d=\"M1.12949 2.1072V1H7V6.85795H5.89111V2.90281L0.784057 8L0 7.21635L5.11902 2.1072H1.12949Z\"></path></svg></span>.","PeriodicalId":417,"journal":{"name":"Remote Sensing of Environment","volume":"80 1","pages":""},"PeriodicalIF":11.1000,"publicationDate":"2024-12-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Remote Sensing of Environment","FirstCategoryId":"5","ListUrlMain":"https://doi.org/10.1016/j.rse.2024.114573","RegionNum":1,"RegionCategory":"地球科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENVIRONMENTAL SCIENCES","Score":null,"Total":0}
引用次数: 0
Abstract
OpenStreetMap (OSM) contains valuable geographic knowledge for remote sensing (RS) interpretation. They can provide correlated and complementary descriptions of a given region. Integrating RS images with OSM data can lead to a more comprehensive understanding of a geographic scene. But due to the significant differences between them, little progress has been made in data fusion for RS and OSM data, and how to extract, interact, and collaborate the information from multiple geographic data sources remains largely unexplored. In this work, we focus on designing a multi-modal self-supervised learning (SSL) approach to fuse RS images and OSM data, which can extract meaningful features from the two complementary data sources in an unsupervised manner, resulting in comprehensive scene understanding. We harmonize the parts of information extraction, interaction, and collaboration for RS and OSM data into a unified SSL framework, named Rose. For information extraction, we start from the complementarity between the two modalities, designing an OSM encoder to harmoniously align with the ViT image encoder. For information interaction, we leverage the spatial correlation between RS and OSM data to guide the cross-attention module, thereby enhancing the information transfer. For information collaboration, we design the joint mask-reconstruction learning strategy to achieve cooperation between the two modalities, which reconstructs the original inputs by referring to information from both sources. The three parts are interlinked and blending seamlessly into a unified framework. Finally, Rose can generate three kinds of representations, i.e., RS feature, OSM feature, and RS-OSM fusion feature, which can be used for multiple downstream tasks. Extensive experiments on land use semantic segmentation, population estimation, and carbon emission estimation tasks demonstrate the multitasking capability, label efficiency, and robustness to noise of Rose. Rose can associate RS images and OSM data at a fine level of granularity, enhancing its effectiveness on fine-grained tasks like land use semantic segmentation. The code can be found at https://github.com/bailubin/Rose.
期刊介绍:
Remote Sensing of Environment (RSE) serves the Earth observation community by disseminating results on the theory, science, applications, and technology that contribute to advancing the field of remote sensing. With a thoroughly interdisciplinary approach, RSE encompasses terrestrial, oceanic, and atmospheric sensing.
The journal emphasizes biophysical and quantitative approaches to remote sensing at local to global scales, covering a diverse range of applications and techniques.
RSE serves as a vital platform for the exchange of knowledge and advancements in the dynamic field of remote sensing.