Xin Zou, Chang Tang, Wei Zhang, Kun Sun, Liangxiao Jiang
{"title":"Hierarchical Attention Learning for Multimodal Classification","authors":"Xin Zou, Chang Tang, Wei Zhang, Kun Sun, Liangxiao Jiang","doi":"10.1109/ICME55011.2023.00165","DOIUrl":null,"url":null,"abstract":"Multimodal learning aims to integrate complementary information from different modalities for more reliable decisions. However, existing multimodal classification methods simply integrate the learned local features, which ignore the underlying structure of each modality and the higher-order correlation across modalities. In this paper, we propose a novel Hierarchical Attention Learning Network (HALNet) for multimodal classification. Specifically, HALNet has three merits: 1) A hierarchical feature fusion module is proposed to learn multilevel features, aggregating multi-level features for a global feature representation with the attention mechanism and progressive fusion tactics. 2) A cross-modal higher-order fusion module is introduced to capture the prospective cross-modal correlations at label space. 3) A dual prediction pattern is designed to generate credible decisions. Extensive experiments on three real-world multimodal datasets demonstrate that HALNet achieves competitive performance compared to the state-of-the-art.","PeriodicalId":321830,"journal":{"name":"2023 IEEE International Conference on Multimedia and Expo (ICME)","volume":"63 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 IEEE International Conference on Multimedia and Expo (ICME)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICME55011.2023.00165","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Multimodal learning aims to integrate complementary information from different modalities for more reliable decisions. However, existing multimodal classification methods simply integrate the learned local features, which ignore the underlying structure of each modality and the higher-order correlation across modalities. In this paper, we propose a novel Hierarchical Attention Learning Network (HALNet) for multimodal classification. Specifically, HALNet has three merits: 1) A hierarchical feature fusion module is proposed to learn multilevel features, aggregating multi-level features for a global feature representation with the attention mechanism and progressive fusion tactics. 2) A cross-modal higher-order fusion module is introduced to capture the prospective cross-modal correlations at label space. 3) A dual prediction pattern is designed to generate credible decisions. Extensive experiments on three real-world multimodal datasets demonstrate that HALNet achieves competitive performance compared to the state-of-the-art.