{"title":"Hierarchical Fusion Transformer for Multimodal Ground-Based Cloud Type Classification","authors":"Shuang Liu;Zeyu Yu;Zhong Zhang;Chaojun Shi;Baihua Xiao","doi":"10.1109/JSTARS.2025.3614756","DOIUrl":null,"url":null,"abstract":"Existing methods for multimodal ground-based cloud type classification are dominated by convolutional neural networks, and it fails to capture long-range dependencies. In this article, we propose a novel Transformer-based architecture named hierarchical fusion transformer (HFT) for multimodal ground-based cloud type classification, which leverages the advantages of self-attention and cross-attention to learn long-range dependencies and effectively fuse cloud images and meteorological element information. Specifically, we propose visual and meteorological joint-transformer (VM Joint-Trans) to capture global context across modalities and present visual and meteorological cross-transformer (VM Cross-Trans) to align different modalities and reduce their inconsistencies. We design a hierarchical architecture to perform comprehensive fusion using VM Joint-Trans and VM Cross-Trans. Meanwhile, we propose the novel multimodal contrastive learning, which not only constrains the tokens of cloud images and meteorological element information in the same layer, but also the tokens from the same modality in different layers, thereby improving the discriminative ability of model and reducing the modality gap. Furthermore, we release the large-scale multimodal ground-based cloud database, containing 10 000 multimodal samples with seven categories. To the best of the authors’ knowledge, it is the largest database for multimodal ground-based cloud type classification. Experimental results validate the effectiveness of the proposed HFT for multimodal ground-based cloud type classification.","PeriodicalId":13116,"journal":{"name":"IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing","volume":"18 ","pages":"25192-25203"},"PeriodicalIF":5.3000,"publicationDate":"2025-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11181175","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/11181175/","RegionNum":2,"RegionCategory":"地球科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
引用次数: 0
Abstract
Existing methods for multimodal ground-based cloud type classification are dominated by convolutional neural networks, and it fails to capture long-range dependencies. In this article, we propose a novel Transformer-based architecture named hierarchical fusion transformer (HFT) for multimodal ground-based cloud type classification, which leverages the advantages of self-attention and cross-attention to learn long-range dependencies and effectively fuse cloud images and meteorological element information. Specifically, we propose visual and meteorological joint-transformer (VM Joint-Trans) to capture global context across modalities and present visual and meteorological cross-transformer (VM Cross-Trans) to align different modalities and reduce their inconsistencies. We design a hierarchical architecture to perform comprehensive fusion using VM Joint-Trans and VM Cross-Trans. Meanwhile, we propose the novel multimodal contrastive learning, which not only constrains the tokens of cloud images and meteorological element information in the same layer, but also the tokens from the same modality in different layers, thereby improving the discriminative ability of model and reducing the modality gap. Furthermore, we release the large-scale multimodal ground-based cloud database, containing 10 000 multimodal samples with seven categories. To the best of the authors’ knowledge, it is the largest database for multimodal ground-based cloud type classification. Experimental results validate the effectiveness of the proposed HFT for multimodal ground-based cloud type classification.
期刊介绍:
The IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing addresses the growing field of applications in Earth observations and remote sensing, and also provides a venue for the rapidly expanding special issues that are being sponsored by the IEEE Geosciences and Remote Sensing Society. The journal draws upon the experience of the highly successful “IEEE Transactions on Geoscience and Remote Sensing” and provide a complementary medium for the wide range of topics in applied earth observations. The ‘Applications’ areas encompasses the societal benefit areas of the Global Earth Observations Systems of Systems (GEOSS) program. Through deliberations over two years, ministers from 50 countries agreed to identify nine areas where Earth observation could positively impact the quality of life and health of their respective countries. Some of these are areas not traditionally addressed in the IEEE context. These include biodiversity, health and climate. Yet it is the skill sets of IEEE members, in areas such as observations, communications, computers, signal processing, standards and ocean engineering, that form the technical underpinnings of GEOSS. Thus, the Journal attracts a broad range of interests that serves both present members in new ways and expands the IEEE visibility into new areas.