{"title":"面向多模态情感识别的分层跨模态空间融合网络","authors":"Ming Xu;Tuo Shi;Hao Zhang;Zeyi Liu;Xiao He","doi":"10.1109/TAI.2024.3523250","DOIUrl":null,"url":null,"abstract":"Recent advancements in emotion recognition research based on physiological data have been notable. However, existing multimodal methods often overlook the interrelations between various modalities, such as video and electroencephalography (EEG) data, in emotion recognition. In this article, a feature fusion-based hierarchical cross-modal spatial fusion network (HCSFNet) is proposed that effectively integrates EEG and video features. By designing an EEG feature extraction network based on 1-D convolution and a video feature extraction network based on 3-D convolution, corresponding modality features are thoroughly extracted. To promote sufficient interaction between the two modalities, a hierarchical cross-modal coordinated attention module is proposed in this article. Additionally, to enhance the network's perceptual ability for emotion-related features, a multiscale spatial pyramid pooling module is also designed. Meanwhile, a self-distillation method is introduced, which enhances the performance while reducing the number of parameters in the network. The HCSFNet achieved an accuracy of 97.78% on the valence–arousal dimension of the Database for Emotion Analysis using Physiological Signals (DEAP) dataset, and it also obtained an accuracy of 60.59% on the MAHNOB-human-computer interaction (HCI) dataset, reaching the state-of-the-art level.","PeriodicalId":73305,"journal":{"name":"IEEE transactions on artificial intelligence","volume":"6 5","pages":"1429-1438"},"PeriodicalIF":0.0000,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A Hierarchical Cross-Modal Spatial Fusion Network for Multimodal Emotion Recognition\",\"authors\":\"Ming Xu;Tuo Shi;Hao Zhang;Zeyi Liu;Xiao He\",\"doi\":\"10.1109/TAI.2024.3523250\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Recent advancements in emotion recognition research based on physiological data have been notable. However, existing multimodal methods often overlook the interrelations between various modalities, such as video and electroencephalography (EEG) data, in emotion recognition. In this article, a feature fusion-based hierarchical cross-modal spatial fusion network (HCSFNet) is proposed that effectively integrates EEG and video features. By designing an EEG feature extraction network based on 1-D convolution and a video feature extraction network based on 3-D convolution, corresponding modality features are thoroughly extracted. To promote sufficient interaction between the two modalities, a hierarchical cross-modal coordinated attention module is proposed in this article. Additionally, to enhance the network's perceptual ability for emotion-related features, a multiscale spatial pyramid pooling module is also designed. Meanwhile, a self-distillation method is introduced, which enhances the performance while reducing the number of parameters in the network. The HCSFNet achieved an accuracy of 97.78% on the valence–arousal dimension of the Database for Emotion Analysis using Physiological Signals (DEAP) dataset, and it also obtained an accuracy of 60.59% on the MAHNOB-human-computer interaction (HCI) dataset, reaching the state-of-the-art level.\",\"PeriodicalId\":73305,\"journal\":{\"name\":\"IEEE transactions on artificial intelligence\",\"volume\":\"6 5\",\"pages\":\"1429-1438\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2025-01-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE transactions on artificial intelligence\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10820048/\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on artificial intelligence","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10820048/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
摘要
近年来,基于生理数据的情绪识别研究取得了令人瞩目的进展。然而,现有的多模态方法往往忽略了情感识别中各种模态(如视频和脑电图数据)之间的相互关系。本文提出了一种基于特征融合的分层跨模态空间融合网络(HCSFNet),可以有效地融合脑电和视频特征。通过设计基于一维卷积的脑电信号特征提取网络和基于三维卷积的视频特征提取网络,彻底提取出相应的模态特征。为了促进两种模式之间的充分互动,本文提出了一个分层的跨模式协调注意模块。此外,为了增强神经网络对情绪相关特征的感知能力,还设计了一个多尺度空间金字塔池模块。同时,引入了一种自蒸馏方法,在减少网络参数数量的同时提高了网络的性能。HCSFNet在情感分析数据库(Database for Emotion Analysis, DEAP)数据集的效价觉醒维度上的准确率达到97.78%,在mahnob -人机交互(human-computer interaction, HCI)数据集上的准确率达到60.59%,达到国际先进水平。
A Hierarchical Cross-Modal Spatial Fusion Network for Multimodal Emotion Recognition
Recent advancements in emotion recognition research based on physiological data have been notable. However, existing multimodal methods often overlook the interrelations between various modalities, such as video and electroencephalography (EEG) data, in emotion recognition. In this article, a feature fusion-based hierarchical cross-modal spatial fusion network (HCSFNet) is proposed that effectively integrates EEG and video features. By designing an EEG feature extraction network based on 1-D convolution and a video feature extraction network based on 3-D convolution, corresponding modality features are thoroughly extracted. To promote sufficient interaction between the two modalities, a hierarchical cross-modal coordinated attention module is proposed in this article. Additionally, to enhance the network's perceptual ability for emotion-related features, a multiscale spatial pyramid pooling module is also designed. Meanwhile, a self-distillation method is introduced, which enhances the performance while reducing the number of parameters in the network. The HCSFNet achieved an accuracy of 97.78% on the valence–arousal dimension of the Database for Emotion Analysis using Physiological Signals (DEAP) dataset, and it also obtained an accuracy of 60.59% on the MAHNOB-human-computer interaction (HCI) dataset, reaching the state-of-the-art level.