Multi-modal Continuous Dimensional Emotion Recognition Using Recurrent Neural Network and Self-Attention Mechanism

Proceedings of the 1st International on Multimodal Sentiment Analysis in Real-life Media Challenge and Workshop Pub Date : 2020-10-16 DOI:10.1145/3423327.3423672

Licai Sun, Zheng Lian, J. Tao, Bin Liu, Mingyue Niu

{"title":"Multi-modal Continuous Dimensional Emotion Recognition Using Recurrent Neural Network and Self-Attention Mechanism","authors":"Licai Sun, Zheng Lian, J. Tao, Bin Liu, Mingyue Niu","doi":"10.1145/3423327.3423672","DOIUrl":null,"url":null,"abstract":"Automatic perception and understanding of human emotion or sentiment has a wide range of applications and has attracted increasing attention nowadays. The Multimodal Sentiment Analysis in Real-life Media (MuSe) 2020 provides a testing bed for recognizing human emotion or sentiment from multiple modalities (audio, video, and text) in the wild scenario. In this paper, we present our solutions to the MuSe-Wild sub-challenge of MuSe 2020. The goal of this sub-challenge is to perform continuous emotion (arousal and valence) predictions on a car review database, Muse-CaR. To this end, we first extract both handcrafted features and deep representations from multiple modalities. Then, we utilize the Long Short-Term Memory (LSTM) recurrent neural network as well as the self-attention mechanism to model the complex temporal dependencies in the sequence. The Concordance Correlation Coefficient (CCC) loss is employed to guide the model to learn local variations and the global trend of emotion simultaneously. Finally, two fusion strategies, early fusion and late fusion, are adopted to further boost the model's performance by exploiting complementary information from different modalities. Our proposed method achieves CCC of 0.4726 and 0.5996 for arousal and valence respectively on the test set, which outperforms the baseline system with corresponding CCC of 0.2834 and 0.2431.","PeriodicalId":246071,"journal":{"name":"Proceedings of the 1st International on Multimodal Sentiment Analysis in Real-life Media Challenge and Workshop","volume":"144 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-10-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"43","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 1st International on Multimodal Sentiment Analysis in Real-life Media Challenge and Workshop","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3423327.3423672","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 43

Abstract

Automatic perception and understanding of human emotion or sentiment has a wide range of applications and has attracted increasing attention nowadays. The Multimodal Sentiment Analysis in Real-life Media (MuSe) 2020 provides a testing bed for recognizing human emotion or sentiment from multiple modalities (audio, video, and text) in the wild scenario. In this paper, we present our solutions to the MuSe-Wild sub-challenge of MuSe 2020. The goal of this sub-challenge is to perform continuous emotion (arousal and valence) predictions on a car review database, Muse-CaR. To this end, we first extract both handcrafted features and deep representations from multiple modalities. Then, we utilize the Long Short-Term Memory (LSTM) recurrent neural network as well as the self-attention mechanism to model the complex temporal dependencies in the sequence. The Concordance Correlation Coefficient (CCC) loss is employed to guide the model to learn local variations and the global trend of emotion simultaneously. Finally, two fusion strategies, early fusion and late fusion, are adopted to further boost the model's performance by exploiting complementary information from different modalities. Our proposed method achieves CCC of 0.4726 and 0.5996 for arousal and valence respectively on the test set, which outperforms the baseline system with corresponding CCC of 0.2834 and 0.2431.

查看原文本刊更多论文

基于递归神经网络和自注意机制的多模态连续维度情绪识别

人类情感或情绪的自动感知和理解具有广泛的应用，目前已引起越来越多的关注。现实生活媒体中的多模态情感分析(MuSe) 2020为在野外场景中从多种模态(音频、视频和文本)中识别人类情感或情感提供了一个测试平台。在本文中，我们提出了我们对MuSe 2020的MuSe- wild子挑战的解决方案。这个子挑战的目标是在汽车评论数据库Muse-CaR上执行连续的情绪(唤起和效价)预测。为此，我们首先从多个模态中提取手工特征和深度表征。然后，我们利用长短期记忆(LSTM)递归神经网络和自注意机制对序列中复杂的时间依赖性进行建模。采用一致性相关系数(CCC)损失来指导模型同时学习情绪的局部变化和全局趋势。最后，采用早期融合和后期融合两种融合策略，利用不同模式的互补信息进一步提高模型的性能。我们提出的方法在测试集上的唤醒和效价的CCC分别为0.4726和0.5996，优于相应的CCC为0.2834和0.2431的基线系统。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 1st International on Multimodal Sentiment Analysis in Real-life Media Challenge and Workshop

自引率

0.00%

发文量