A Multimodal Unified Representation Learning Framework With Masked Image Modeling for Remote Sensing Images

IF 7.5 1区地球科学 Q1 ENGINEERING, ELECTRICAL & ELECTRONIC

IEEE Transactions on Geoscience and Remote Sensing Pub Date : 2024-11-18 DOI:10.1109/TGRS.2024.3494244

Dakuan Du;Tianzhu Liu;Yanfeng Gu

{"title":"A Multimodal Unified Representation Learning Framework With Masked Image Modeling for Remote Sensing Images","authors":"Dakuan Du;Tianzhu Liu;Yanfeng Gu","doi":"10.1109/TGRS.2024.3494244","DOIUrl":null,"url":null,"abstract":"The coordinated utilization of diverse types of satellite sensors provides a more comprehensive view of the Earth’s surface. However, due to the significant heterogeneity across modalities and the scarcity of high-quality labels, most existing methods face bottlenecks in the underutilization of massive unlabeled multimodal satellite data, making it challenging to understand the scene comprehensively. To this end, we propose a multimodal unified representation learning framework (MURLF) based on masked image modeling (MIM) for remote sensing (RS) images, aiming to make better use of massive unlabeled multimodal RS data. MURLF leverages the consistency and complementarity relationships among modalities to extract both common and distinctive features, mitigating the challenges faced by encoders due to significant heterogeneity across various data types. In addition, MURLF uses multilevel masking independently across different modalities, using visual tokens both within the same modality and across modalities to jointly recover masked pixels as the pretext task, facilitating comprehensive cross-modal information interaction. Furthermore, we design a preselected sensor-specific feature extractor (PSFE) to exploit the heterogeneous characteristics of various data sources, thereby extracting discriminative features. By integrating the multistage PSFE with the ViT backbone, MURLF can naturally extract multimodal hierarchical representations for downstream tasks, fully preserving valuable information from each modality. The proposed MURLF is not restricted to multimodal inputs but also supports single-modal inputs during the fine-tuning stage, significantly broadening the framework’s application. Extensive experiments across multiple tasks demonstrate the superiority of the proposed MURLF compared with several advanced multimodal models. The code will be released soon.","PeriodicalId":13213,"journal":{"name":"IEEE Transactions on Geoscience and Remote Sensing","volume":"62 ","pages":"1-16"},"PeriodicalIF":7.5000,"publicationDate":"2024-11-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Geoscience and Remote Sensing","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10756791/","RegionNum":1,"RegionCategory":"地球科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}

引用次数: 0

Abstract

The coordinated utilization of diverse types of satellite sensors provides a more comprehensive view of the Earth’s surface. However, due to the significant heterogeneity across modalities and the scarcity of high-quality labels, most existing methods face bottlenecks in the underutilization of massive unlabeled multimodal satellite data, making it challenging to understand the scene comprehensively. To this end, we propose a multimodal unified representation learning framework (MURLF) based on masked image modeling (MIM) for remote sensing (RS) images, aiming to make better use of massive unlabeled multimodal RS data. MURLF leverages the consistency and complementarity relationships among modalities to extract both common and distinctive features, mitigating the challenges faced by encoders due to significant heterogeneity across various data types. In addition, MURLF uses multilevel masking independently across different modalities, using visual tokens both within the same modality and across modalities to jointly recover masked pixels as the pretext task, facilitating comprehensive cross-modal information interaction. Furthermore, we design a preselected sensor-specific feature extractor (PSFE) to exploit the heterogeneous characteristics of various data sources, thereby extracting discriminative features. By integrating the multistage PSFE with the ViT backbone, MURLF can naturally extract multimodal hierarchical representations for downstream tasks, fully preserving valuable information from each modality. The proposed MURLF is not restricted to multimodal inputs but also supports single-modal inputs during the fine-tuning stage, significantly broadening the framework’s application. Extensive experiments across multiple tasks demonstrate the superiority of the proposed MURLF compared with several advanced multimodal models. The code will be released soon.

查看原文本刊更多论文

针对遥感图像的多模式统一表征学习框架与遮蔽图像建模

协调利用不同类型的卫星传感器，可以更全面地观察地球表面。然而，由于模式之间的显著异质性和高质量标签的稀缺性，大多数现有方法在大量未标记的多模式卫星数据的未充分利用方面面临瓶颈，这给全面理解场景带来了挑战。为此，我们提出了一种基于掩模图像建模（MIM）的遥感图像多模态统一表示学习框架（MURLF），旨在更好地利用大量未标记的多模态遥感数据。MURLF利用模式之间的一致性和互补性关系来提取共同和独特的特征，减轻编码器因各种数据类型的显著异质性而面临的挑战。此外，MURLF在不同模态上独立使用多层掩蔽，使用同一模态内和跨模态的视觉标记共同恢复被掩蔽像素作为借口任务，促进全面的跨模态信息交互。此外，我们设计了一个预选传感器特定特征提取器（PSFE）来利用各种数据源的异构特征，从而提取判别特征。通过将多阶段PSFE与ViT主干集成，MURLF可以自然地为下游任务提取多模态分层表示，并从每个模态中充分保留有价值的信息。所提出的MURLF不仅限于多模态输入，而且在微调阶段也支持单模态输入，大大拓宽了框架的应用范围。跨多任务的大量实验表明，与几种先进的多模态模型相比，所提出的多模态模型具有优越性。代码将很快发布。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Transactions on Geoscience and Remote Sensing 工程技术-地球化学与地球物理

CiteScore

11.50

自引率

28.00%

发文量

1912

审稿时长

4.0 months

期刊介绍： IEEE Transactions on Geoscience and Remote Sensing (TGRS) is a monthly publication that focuses on the theory, concepts, and techniques of science and engineering as applied to sensing the land, oceans, atmosphere, and space; and the processing, interpretation, and dissemination of this information.