Multi-modal Continuous Dimensional Emotion Recognition Using Recurrent Neural Network and Self-Attention Mechanism

Licai Sun, Zheng Lian, J. Tao, Bin Liu, Mingyue Niu
{"title":"Multi-modal Continuous Dimensional Emotion Recognition Using Recurrent Neural Network and Self-Attention Mechanism","authors":"Licai Sun, Zheng Lian, J. Tao, Bin Liu, Mingyue Niu","doi":"10.1145/3423327.3423672","DOIUrl":null,"url":null,"abstract":"Automatic perception and understanding of human emotion or sentiment has a wide range of applications and has attracted increasing attention nowadays. The Multimodal Sentiment Analysis in Real-life Media (MuSe) 2020 provides a testing bed for recognizing human emotion or sentiment from multiple modalities (audio, video, and text) in the wild scenario. In this paper, we present our solutions to the MuSe-Wild sub-challenge of MuSe 2020. The goal of this sub-challenge is to perform continuous emotion (arousal and valence) predictions on a car review database, Muse-CaR. To this end, we first extract both handcrafted features and deep representations from multiple modalities. Then, we utilize the Long Short-Term Memory (LSTM) recurrent neural network as well as the self-attention mechanism to model the complex temporal dependencies in the sequence. The Concordance Correlation Coefficient (CCC) loss is employed to guide the model to learn local variations and the global trend of emotion simultaneously. Finally, two fusion strategies, early fusion and late fusion, are adopted to further boost the model's performance by exploiting complementary information from different modalities. Our proposed method achieves CCC of 0.4726 and 0.5996 for arousal and valence respectively on the test set, which outperforms the baseline system with corresponding CCC of 0.2834 and 0.2431.","PeriodicalId":246071,"journal":{"name":"Proceedings of the 1st International on Multimodal Sentiment Analysis in Real-life Media Challenge and Workshop","volume":"144 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-10-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"43","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 1st International on Multimodal Sentiment Analysis in Real-life Media Challenge and Workshop","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3423327.3423672","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 43

Abstract

Automatic perception and understanding of human emotion or sentiment has a wide range of applications and has attracted increasing attention nowadays. The Multimodal Sentiment Analysis in Real-life Media (MuSe) 2020 provides a testing bed for recognizing human emotion or sentiment from multiple modalities (audio, video, and text) in the wild scenario. In this paper, we present our solutions to the MuSe-Wild sub-challenge of MuSe 2020. The goal of this sub-challenge is to perform continuous emotion (arousal and valence) predictions on a car review database, Muse-CaR. To this end, we first extract both handcrafted features and deep representations from multiple modalities. Then, we utilize the Long Short-Term Memory (LSTM) recurrent neural network as well as the self-attention mechanism to model the complex temporal dependencies in the sequence. The Concordance Correlation Coefficient (CCC) loss is employed to guide the model to learn local variations and the global trend of emotion simultaneously. Finally, two fusion strategies, early fusion and late fusion, are adopted to further boost the model's performance by exploiting complementary information from different modalities. Our proposed method achieves CCC of 0.4726 and 0.5996 for arousal and valence respectively on the test set, which outperforms the baseline system with corresponding CCC of 0.2834 and 0.2431.
基于递归神经网络和自注意机制的多模态连续维度情绪识别
人类情感或情绪的自动感知和理解具有广泛的应用,目前已引起越来越多的关注。现实生活媒体中的多模态情感分析(MuSe) 2020为在野外场景中从多种模态(音频、视频和文本)中识别人类情感或情感提供了一个测试平台。在本文中,我们提出了我们对MuSe 2020的MuSe- wild子挑战的解决方案。这个子挑战的目标是在汽车评论数据库Muse-CaR上执行连续的情绪(唤起和效价)预测。为此,我们首先从多个模态中提取手工特征和深度表征。然后,我们利用长短期记忆(LSTM)递归神经网络和自注意机制对序列中复杂的时间依赖性进行建模。采用一致性相关系数(CCC)损失来指导模型同时学习情绪的局部变化和全局趋势。最后,采用早期融合和后期融合两种融合策略,利用不同模式的互补信息进一步提高模型的性能。我们提出的方法在测试集上的唤醒和效价的CCC分别为0.4726和0.5996,优于相应的CCC为0.2834和0.2431的基线系统。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信