{"title":"A Hierarchical Cross-Modal Spatial Fusion Network for Multimodal Emotion Recognition","authors":"Ming Xu;Tuo Shi;Hao Zhang;Zeyi Liu;Xiao He","doi":"10.1109/TAI.2024.3523250","DOIUrl":null,"url":null,"abstract":"Recent advancements in emotion recognition research based on physiological data have been notable. However, existing multimodal methods often overlook the interrelations between various modalities, such as video and electroencephalography (EEG) data, in emotion recognition. In this article, a feature fusion-based hierarchical cross-modal spatial fusion network (HCSFNet) is proposed that effectively integrates EEG and video features. By designing an EEG feature extraction network based on 1-D convolution and a video feature extraction network based on 3-D convolution, corresponding modality features are thoroughly extracted. To promote sufficient interaction between the two modalities, a hierarchical cross-modal coordinated attention module is proposed in this article. Additionally, to enhance the network's perceptual ability for emotion-related features, a multiscale spatial pyramid pooling module is also designed. Meanwhile, a self-distillation method is introduced, which enhances the performance while reducing the number of parameters in the network. The HCSFNet achieved an accuracy of 97.78% on the valence–arousal dimension of the Database for Emotion Analysis using Physiological Signals (DEAP) dataset, and it also obtained an accuracy of 60.59% on the MAHNOB-human-computer interaction (HCI) dataset, reaching the state-of-the-art level.","PeriodicalId":73305,"journal":{"name":"IEEE transactions on artificial intelligence","volume":"6 5","pages":"1429-1438"},"PeriodicalIF":0.0000,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on artificial intelligence","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10820048/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Recent advancements in emotion recognition research based on physiological data have been notable. However, existing multimodal methods often overlook the interrelations between various modalities, such as video and electroencephalography (EEG) data, in emotion recognition. In this article, a feature fusion-based hierarchical cross-modal spatial fusion network (HCSFNet) is proposed that effectively integrates EEG and video features. By designing an EEG feature extraction network based on 1-D convolution and a video feature extraction network based on 3-D convolution, corresponding modality features are thoroughly extracted. To promote sufficient interaction between the two modalities, a hierarchical cross-modal coordinated attention module is proposed in this article. Additionally, to enhance the network's perceptual ability for emotion-related features, a multiscale spatial pyramid pooling module is also designed. Meanwhile, a self-distillation method is introduced, which enhances the performance while reducing the number of parameters in the network. The HCSFNet achieved an accuracy of 97.78% on the valence–arousal dimension of the Database for Emotion Analysis using Physiological Signals (DEAP) dataset, and it also obtained an accuracy of 60.59% on the MAHNOB-human-computer interaction (HCI) dataset, reaching the state-of-the-art level.