基于残差门控网络和情感强度注意的多模态情感识别

IF 6 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Neural Networks Pub Date : 2025-04-25 DOI:10.1016/j.neunet.2025.107483

Yadi Wang , Xiaoding Guo , Xianhong Hou , Zhijun Miao , Xiaojin Yang , Jinkai Guo

{"title":"基于残差门控网络和情感强度注意的多模态情感识别","authors":"Yadi Wang , Xiaoding Guo , Xianhong Hou , Zhijun Miao , Xiaojin Yang , Jinkai Guo","doi":"10.1016/j.neunet.2025.107483","DOIUrl":null,"url":null,"abstract":"<div><div>Multimodal emotion recognition focuses on the prediction of emotions using text, visual and acoustic modalities, and some results have been generated in this field. Previous approaches fall short in two aspects, one is the processing of complementary information among modalities, the other is how to avoid the long-term dependency and select the most important joint modal features. In this paper, we propose a new multimodal emotion recognition framework MSRG, which consists of feature extraction (FE), emotional intensity attention (EIA), time-step level fusion (TLF), utterance level fusion (ULF), and sentiment inference module (SIM). EIA is divided into adaptive multimodal linear pooling (AMLP) and joint cross-attention fusion (JCAF), where AMLP adopts the adaptive strategy of multimodal fusion to dynamically calculate the adaptive coefficients of three modalities, then performs the pooling operation to obtain joint modal features. JCAF calculates the attention weights and attention features of each modality based on cross-correlation between individual and joint features. TLF performs feature alignment fusion at the time-step level, then uses the residual gating network (RGN) to process the time-step level fused sequences. The obtained time-step level fused features are then input into two fully connected layers and an activation layer to obtain the time-step level emotion intensity. ULF fuses the three modalities’ utterance level representations by concatenating them and then inputs the obtained utterance level fused features into a fully connected layer to obtain the utterance level emotion intensity. Finally, both the time-step level emotion intensity and the utterance level emotion intensity are input into SIM to obtain the final emotion prediction results. Experiments demonstrate that MSRG achieves better prediction performance on CMU-MOSI and CMU-MOSEI datasets.</div></div>","PeriodicalId":49763,"journal":{"name":"Neural Networks","volume":"188 ","pages":"Article 107483"},"PeriodicalIF":6.0000,"publicationDate":"2025-04-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Multi-modal sentiment recognition with residual gating network and emotion intensity attention\",\"authors\":\"Yadi Wang , Xiaoding Guo , Xianhong Hou , Zhijun Miao , Xiaojin Yang , Jinkai Guo\",\"doi\":\"10.1016/j.neunet.2025.107483\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Multimodal emotion recognition focuses on the prediction of emotions using text, visual and acoustic modalities, and some results have been generated in this field. Previous approaches fall short in two aspects, one is the processing of complementary information among modalities, the other is how to avoid the long-term dependency and select the most important joint modal features. In this paper, we propose a new multimodal emotion recognition framework MSRG, which consists of feature extraction (FE), emotional intensity attention (EIA), time-step level fusion (TLF), utterance level fusion (ULF), and sentiment inference module (SIM). EIA is divided into adaptive multimodal linear pooling (AMLP) and joint cross-attention fusion (JCAF), where AMLP adopts the adaptive strategy of multimodal fusion to dynamically calculate the adaptive coefficients of three modalities, then performs the pooling operation to obtain joint modal features. JCAF calculates the attention weights and attention features of each modality based on cross-correlation between individual and joint features. TLF performs feature alignment fusion at the time-step level, then uses the residual gating network (RGN) to process the time-step level fused sequences. The obtained time-step level fused features are then input into two fully connected layers and an activation layer to obtain the time-step level emotion intensity. ULF fuses the three modalities’ utterance level representations by concatenating them and then inputs the obtained utterance level fused features into a fully connected layer to obtain the utterance level emotion intensity. Finally, both the time-step level emotion intensity and the utterance level emotion intensity are input into SIM to obtain the final emotion prediction results. Experiments demonstrate that MSRG achieves better prediction performance on CMU-MOSI and CMU-MOSEI datasets.</div></div>\",\"PeriodicalId\":49763,\"journal\":{\"name\":\"Neural Networks\",\"volume\":\"188 \",\"pages\":\"Article 107483\"},\"PeriodicalIF\":6.0000,\"publicationDate\":\"2025-04-25\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Neural Networks\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0893608025003624\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Neural Networks","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0893608025003624","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

多模态情绪识别研究的重点是利用文本、视觉和听觉三种模式对情绪进行预测，在这一领域已经取得了一些成果。以往的方法在两个方面存在不足，一是模态间互补信息的处理，二是如何避免长期依赖，选择最重要的联合模态特征。本文提出了一种新的多模态情感识别框架MSRG，该框架由特征提取（FE）、情感强度注意（EIA）、时间步长融合（TLF）、话语融合（ULF）和情感推理模块（SIM）组成。EIA分为自适应多模态线性池化（AMLP）和联合交叉注意融合（JCAF），其中AMLP采用多模态融合的自适应策略，动态计算三模态的自适应系数，然后进行池化运算，获得联合模态特征。JCAF基于个体特征和联合特征之间的相互关联计算各模态的注意权重和注意特征。TLF在时间步长级进行特征对齐融合，然后利用残差门控网络（RGN）对融合序列进行处理。然后将得到的时间步长级融合特征输入到两个完全连接的层和一个激活层中，以获得时间步长级的情感强度。ULF将三种模态的话语级表征进行拼接融合，然后将得到的话语级融合特征输入到一个全连接层中，得到话语级情感强度。最后，将时间步长级情绪强度和话语级情绪强度同时输入到SIM中，得到最终的情绪预测结果。实验表明，MSRG在CMU-MOSI和CMU-MOSEI数据集上取得了较好的预测性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Multi-modal sentiment recognition with residual gating network and emotion intensity attention

Multimodal emotion recognition focuses on the prediction of emotions using text, visual and acoustic modalities, and some results have been generated in this field. Previous approaches fall short in two aspects, one is the processing of complementary information among modalities, the other is how to avoid the long-term dependency and select the most important joint modal features. In this paper, we propose a new multimodal emotion recognition framework MSRG, which consists of feature extraction (FE), emotional intensity attention (EIA), time-step level fusion (TLF), utterance level fusion (ULF), and sentiment inference module (SIM). EIA is divided into adaptive multimodal linear pooling (AMLP) and joint cross-attention fusion (JCAF), where AMLP adopts the adaptive strategy of multimodal fusion to dynamically calculate the adaptive coefficients of three modalities, then performs the pooling operation to obtain joint modal features. JCAF calculates the attention weights and attention features of each modality based on cross-correlation between individual and joint features. TLF performs feature alignment fusion at the time-step level, then uses the residual gating network (RGN) to process the time-step level fused sequences. The obtained time-step level fused features are then input into two fully connected layers and an activation layer to obtain the time-step level emotion intensity. ULF fuses the three modalities’ utterance level representations by concatenating them and then inputs the obtained utterance level fused features into a fully connected layer to obtain the utterance level emotion intensity. Finally, both the time-step level emotion intensity and the utterance level emotion intensity are input into SIM to obtain the final emotion prediction results. Experiments demonstrate that MSRG achieves better prediction performance on CMU-MOSI and CMU-MOSEI datasets.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Neural Networks 工程技术-计算机：人工智能

CiteScore

13.90

自引率

7.70%

发文量

425

审稿时长

67 days

期刊介绍： Neural Networks is a platform that aims to foster an international community of scholars and practitioners interested in neural networks, deep learning, and other approaches to artificial intelligence and machine learning. Our journal invites submissions covering various aspects of neural networks research, from computational neuroscience and cognitive modeling to mathematical analyses and engineering applications. By providing a forum for interdisciplinary discussions between biology and technology, we aim to encourage the development of biologically-inspired artificial intelligence.