Hierarchical Fusion Transformer for Multimodal Ground-Based Cloud Type Classification

IF 5.3 2区地球科学 Q1 ENGINEERING, ELECTRICAL & ELECTRONIC

IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing Pub Date : 2025-09-26 DOI:10.1109/JSTARS.2025.3614756

Shuang Liu;Zeyu Yu;Zhong Zhang;Chaojun Shi;Baihua Xiao

{"title":"Hierarchical Fusion Transformer for Multimodal Ground-Based Cloud Type Classification","authors":"Shuang Liu;Zeyu Yu;Zhong Zhang;Chaojun Shi;Baihua Xiao","doi":"10.1109/JSTARS.2025.3614756","DOIUrl":null,"url":null,"abstract":"Existing methods for multimodal ground-based cloud type classification are dominated by convolutional neural networks, and it fails to capture long-range dependencies. In this article, we propose a novel Transformer-based architecture named hierarchical fusion transformer (HFT) for multimodal ground-based cloud type classification, which leverages the advantages of self-attention and cross-attention to learn long-range dependencies and effectively fuse cloud images and meteorological element information. Specifically, we propose visual and meteorological joint-transformer (VM Joint-Trans) to capture global context across modalities and present visual and meteorological cross-transformer (VM Cross-Trans) to align different modalities and reduce their inconsistencies. We design a hierarchical architecture to perform comprehensive fusion using VM Joint-Trans and VM Cross-Trans. Meanwhile, we propose the novel multimodal contrastive learning, which not only constrains the tokens of cloud images and meteorological element information in the same layer, but also the tokens from the same modality in different layers, thereby improving the discriminative ability of model and reducing the modality gap. Furthermore, we release the large-scale multimodal ground-based cloud database, containing 10 000 multimodal samples with seven categories. To the best of the authors’ knowledge, it is the largest database for multimodal ground-based cloud type classification. Experimental results validate the effectiveness of the proposed HFT for multimodal ground-based cloud type classification.","PeriodicalId":13116,"journal":{"name":"IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing","volume":"18 ","pages":"25192-25203"},"PeriodicalIF":5.3000,"publicationDate":"2025-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11181175","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/11181175/","RegionNum":2,"RegionCategory":"地球科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}

引用次数: 0

Abstract

Existing methods for multimodal ground-based cloud type classification are dominated by convolutional neural networks, and it fails to capture long-range dependencies. In this article, we propose a novel Transformer-based architecture named hierarchical fusion transformer (HFT) for multimodal ground-based cloud type classification, which leverages the advantages of self-attention and cross-attention to learn long-range dependencies and effectively fuse cloud images and meteorological element information. Specifically, we propose visual and meteorological joint-transformer (VM Joint-Trans) to capture global context across modalities and present visual and meteorological cross-transformer (VM Cross-Trans) to align different modalities and reduce their inconsistencies. We design a hierarchical architecture to perform comprehensive fusion using VM Joint-Trans and VM Cross-Trans. Meanwhile, we propose the novel multimodal contrastive learning, which not only constrains the tokens of cloud images and meteorological element information in the same layer, but also the tokens from the same modality in different layers, thereby improving the discriminative ability of model and reducing the modality gap. Furthermore, we release the large-scale multimodal ground-based cloud database, containing 10 000 multimodal samples with seven categories. To the best of the authors’ knowledge, it is the largest database for multimodal ground-based cloud type classification. Experimental results validate the effectiveness of the proposed HFT for multimodal ground-based cloud type classification.

查看原文本刊更多论文

基于多模态地面云类型分类的分层融合变压器

现有的基于地面的多模态云类型分类方法以卷积神经网络为主，无法捕获远程依赖关系。在本文中，我们提出了一种新的基于transformer的多模态地面云类型分类体系结构——层次化融合变压器（hierarchical fusion transformer， HFT），利用自关注和交叉关注的优势学习远程依赖关系，有效融合云图和气象要素信息。具体来说，我们提出了视觉和气象联合变压器（VM Joint-Trans）来捕捉跨模式的全球背景，并提出了视觉和气象交叉变压器（VM Cross-Trans）来对齐不同的模式并减少它们的不一致性。采用虚拟机联合转换和虚拟机交叉转换两种方法，设计了一种层次结构进行综合融合。同时，我们提出了一种新的多模态对比学习方法，该方法不仅对同一层云图和气象要素信息的标记进行约束，而且对不同层同一模态的标记进行约束，从而提高了模型的判别能力，减小了模态差距。此外，我们还发布了包含7个类别的10000个多模态样本的大型地基云数据库。据作者所知，它是最大的多模态地面云类型分类数据库。实验结果验证了该方法在多模态地面云类型分类中的有效性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 地学-成像科学与照相技术

CiteScore

9.30

自引率

10.90%

发文量

563

审稿时长

4.7 months

期刊介绍： The IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing addresses the growing field of applications in Earth observations and remote sensing, and also provides a venue for the rapidly expanding special issues that are being sponsored by the IEEE Geosciences and Remote Sensing Society. The journal draws upon the experience of the highly successful “IEEE Transactions on Geoscience and Remote Sensing” and provide a complementary medium for the wide range of topics in applied earth observations. The ‘Applications’ areas encompasses the societal benefit areas of the Global Earth Observations Systems of Systems (GEOSS) program. Through deliberations over two years, ministers from 50 countries agreed to identify nine areas where Earth observation could positively impact the quality of life and health of their respective countries. Some of these are areas not traditionally addressed in the IEEE context. These include biodiversity, health and climate. Yet it is the skill sets of IEEE members, in areas such as observations, communications, computers, signal processing, standards and ocean engineering, that form the technical underpinnings of GEOSS. Thus, the Journal attracts a broad range of interests that serves both present members in new ways and expands the IEEE visibility into new areas.