Yadi Wang , Xiaoding Guo , Xianhong Hou , Zhijun Miao , Xiaojin Yang , Jinkai Guo
{"title":"基于残差门控网络和情感强度注意的多模态情感识别","authors":"Yadi Wang , Xiaoding Guo , Xianhong Hou , Zhijun Miao , Xiaojin Yang , Jinkai Guo","doi":"10.1016/j.neunet.2025.107483","DOIUrl":null,"url":null,"abstract":"<div><div>Multimodal emotion recognition focuses on the prediction of emotions using text, visual and acoustic modalities, and some results have been generated in this field. Previous approaches fall short in two aspects, one is the processing of complementary information among modalities, the other is how to avoid the long-term dependency and select the most important joint modal features. In this paper, we propose a new multimodal emotion recognition framework MSRG, which consists of feature extraction (FE), emotional intensity attention (EIA), time-step level fusion (TLF), utterance level fusion (ULF), and sentiment inference module (SIM). EIA is divided into adaptive multimodal linear pooling (AMLP) and joint cross-attention fusion (JCAF), where AMLP adopts the adaptive strategy of multimodal fusion to dynamically calculate the adaptive coefficients of three modalities, then performs the pooling operation to obtain joint modal features. JCAF calculates the attention weights and attention features of each modality based on cross-correlation between individual and joint features. TLF performs feature alignment fusion at the time-step level, then uses the residual gating network (RGN) to process the time-step level fused sequences. The obtained time-step level fused features are then input into two fully connected layers and an activation layer to obtain the time-step level emotion intensity. ULF fuses the three modalities’ utterance level representations by concatenating them and then inputs the obtained utterance level fused features into a fully connected layer to obtain the utterance level emotion intensity. Finally, both the time-step level emotion intensity and the utterance level emotion intensity are input into SIM to obtain the final emotion prediction results. Experiments demonstrate that MSRG achieves better prediction performance on CMU-MOSI and CMU-MOSEI datasets.</div></div>","PeriodicalId":49763,"journal":{"name":"Neural Networks","volume":"188 ","pages":"Article 107483"},"PeriodicalIF":6.0000,"publicationDate":"2025-04-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Multi-modal sentiment recognition with residual gating network and emotion intensity attention\",\"authors\":\"Yadi Wang , Xiaoding Guo , Xianhong Hou , Zhijun Miao , Xiaojin Yang , Jinkai Guo\",\"doi\":\"10.1016/j.neunet.2025.107483\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Multimodal emotion recognition focuses on the prediction of emotions using text, visual and acoustic modalities, and some results have been generated in this field. Previous approaches fall short in two aspects, one is the processing of complementary information among modalities, the other is how to avoid the long-term dependency and select the most important joint modal features. In this paper, we propose a new multimodal emotion recognition framework MSRG, which consists of feature extraction (FE), emotional intensity attention (EIA), time-step level fusion (TLF), utterance level fusion (ULF), and sentiment inference module (SIM). EIA is divided into adaptive multimodal linear pooling (AMLP) and joint cross-attention fusion (JCAF), where AMLP adopts the adaptive strategy of multimodal fusion to dynamically calculate the adaptive coefficients of three modalities, then performs the pooling operation to obtain joint modal features. JCAF calculates the attention weights and attention features of each modality based on cross-correlation between individual and joint features. TLF performs feature alignment fusion at the time-step level, then uses the residual gating network (RGN) to process the time-step level fused sequences. The obtained time-step level fused features are then input into two fully connected layers and an activation layer to obtain the time-step level emotion intensity. ULF fuses the three modalities’ utterance level representations by concatenating them and then inputs the obtained utterance level fused features into a fully connected layer to obtain the utterance level emotion intensity. Finally, both the time-step level emotion intensity and the utterance level emotion intensity are input into SIM to obtain the final emotion prediction results. Experiments demonstrate that MSRG achieves better prediction performance on CMU-MOSI and CMU-MOSEI datasets.</div></div>\",\"PeriodicalId\":49763,\"journal\":{\"name\":\"Neural Networks\",\"volume\":\"188 \",\"pages\":\"Article 107483\"},\"PeriodicalIF\":6.0000,\"publicationDate\":\"2025-04-25\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Neural Networks\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0893608025003624\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Neural Networks","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0893608025003624","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
Multi-modal sentiment recognition with residual gating network and emotion intensity attention
Multimodal emotion recognition focuses on the prediction of emotions using text, visual and acoustic modalities, and some results have been generated in this field. Previous approaches fall short in two aspects, one is the processing of complementary information among modalities, the other is how to avoid the long-term dependency and select the most important joint modal features. In this paper, we propose a new multimodal emotion recognition framework MSRG, which consists of feature extraction (FE), emotional intensity attention (EIA), time-step level fusion (TLF), utterance level fusion (ULF), and sentiment inference module (SIM). EIA is divided into adaptive multimodal linear pooling (AMLP) and joint cross-attention fusion (JCAF), where AMLP adopts the adaptive strategy of multimodal fusion to dynamically calculate the adaptive coefficients of three modalities, then performs the pooling operation to obtain joint modal features. JCAF calculates the attention weights and attention features of each modality based on cross-correlation between individual and joint features. TLF performs feature alignment fusion at the time-step level, then uses the residual gating network (RGN) to process the time-step level fused sequences. The obtained time-step level fused features are then input into two fully connected layers and an activation layer to obtain the time-step level emotion intensity. ULF fuses the three modalities’ utterance level representations by concatenating them and then inputs the obtained utterance level fused features into a fully connected layer to obtain the utterance level emotion intensity. Finally, both the time-step level emotion intensity and the utterance level emotion intensity are input into SIM to obtain the final emotion prediction results. Experiments demonstrate that MSRG achieves better prediction performance on CMU-MOSI and CMU-MOSEI datasets.
期刊介绍:
Neural Networks is a platform that aims to foster an international community of scholars and practitioners interested in neural networks, deep learning, and other approaches to artificial intelligence and machine learning. Our journal invites submissions covering various aspects of neural networks research, from computational neuroscience and cognitive modeling to mathematical analyses and engineering applications. By providing a forum for interdisciplinary discussions between biology and technology, we aim to encourage the development of biologically-inspired artificial intelligence.