Multimodal Emotion Recognition and Sentiment Analysis via Attention Enhanced Recurrent Model

Proceedings of the 2nd on Multimodal Sentiment Analysis Challenge Pub Date : 2021-10-15 DOI:10.1145/3475957.3484456

Licai Sun, Mingyu Xu, Zheng Lian, B. Liu, J. Tao, Meng Wang, Yuan Cheng

{"title":"Multimodal Emotion Recognition and Sentiment Analysis via Attention Enhanced Recurrent Model","authors":"Licai Sun, Mingyu Xu, Zheng Lian, B. Liu, J. Tao, Meng Wang, Yuan Cheng","doi":"10.1145/3475957.3484456","DOIUrl":null,"url":null,"abstract":"With the proliferation of user-generated videos in online websites, it becomes particularly important to achieve automatic perception and understanding of human emotion/sentiment from these videos. In this paper, we present our solutions to the MuSe-Wilder and MuSe-Sent sub-challenges in MuSe 2021 Multimodal Sentiment Analysis Challenge. MuSe-Wilder focuses on continuous emotion (i.e., arousal and valence) recognition while the task of MuSe-Sent concentrates on discrete sentiment classification. To this end, we first extract a variety of features from three common modalities (i.e., audio, visual, and text), including both low-level handcrafted features and high-level deep representations from supervised/unsupervised pre-trained models. Then, the long short-term memory recurrent neural network, as well as the self-attention mechanism is employed to model the complex temporal dependencies in the feature sequence. The concordance correlation coefficient (CCC) loss and F1-loss are used to guide continuous regression and discrete classification, respectively. To further boost the model's performance, we adopt late fusion to exploit complementary information from different modalities. Our proposed method achieves CCCs of 0.4117 and 0.6649 for arousal and valence respectively on the test set of MuSe-Wilder, which outperforms the baseline system (i.e., 0.3386 and 0.5974) by a large margin. For MuSe-Sent, F1-scores of 0.3614 and 0.4451 for arousal and valence are obtained, which also outperforms the baseline system significantly (i.e., 0.3512 and 0.3291). With these promising results, we ranked top3 in both sub-challenges.","PeriodicalId":313996,"journal":{"name":"Proceedings of the 2nd on Multimodal Sentiment Analysis Challenge","volume":"101 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"15","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2nd on Multimodal Sentiment Analysis Challenge","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3475957.3484456","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 15

Abstract

With the proliferation of user-generated videos in online websites, it becomes particularly important to achieve automatic perception and understanding of human emotion/sentiment from these videos. In this paper, we present our solutions to the MuSe-Wilder and MuSe-Sent sub-challenges in MuSe 2021 Multimodal Sentiment Analysis Challenge. MuSe-Wilder focuses on continuous emotion (i.e., arousal and valence) recognition while the task of MuSe-Sent concentrates on discrete sentiment classification. To this end, we first extract a variety of features from three common modalities (i.e., audio, visual, and text), including both low-level handcrafted features and high-level deep representations from supervised/unsupervised pre-trained models. Then, the long short-term memory recurrent neural network, as well as the self-attention mechanism is employed to model the complex temporal dependencies in the feature sequence. The concordance correlation coefficient (CCC) loss and F1-loss are used to guide continuous regression and discrete classification, respectively. To further boost the model's performance, we adopt late fusion to exploit complementary information from different modalities. Our proposed method achieves CCCs of 0.4117 and 0.6649 for arousal and valence respectively on the test set of MuSe-Wilder, which outperforms the baseline system (i.e., 0.3386 and 0.5974) by a large margin. For MuSe-Sent, F1-scores of 0.3614 and 0.4451 for arousal and valence are obtained, which also outperforms the baseline system significantly (i.e., 0.3512 and 0.3291). With these promising results, we ranked top3 in both sub-challenges.

查看原文本刊更多论文

基于注意增强循环模型的多模态情绪识别与情绪分析

随着在线网站中用户生成视频的激增，从这些视频中实现对人类情感/情绪的自动感知和理解变得尤为重要。在本文中，我们提出了MuSe 2021多模态情感分析挑战赛中MuSe- wilder和MuSe- sent子挑战的解决方案。MuSe-Wilder侧重于连续情绪(即唤醒和效价)识别，而MuSe-Sent侧重于离散情绪分类。为此，我们首先从三种常见模式(即音频、视觉和文本)中提取各种特征，包括低级手工制作的特征和来自监督/无监督预训练模型的高级深度表示。然后，利用长短期记忆递归神经网络和自注意机制对特征序列中复杂的时间依赖性进行建模。一致性相关系数(CCC)损失和f1损失分别用于指导连续回归和离散分类。为了进一步提高模型的性能，我们采用后期融合来利用不同模态的互补信息。我们提出的方法在MuSe-Wilder测试集上的唤醒和效价CCCs分别为0.4117和0.6649，大大优于基线系统(即0.3386和0.5974)。MuSe-Sent的唤醒和效价f1得分分别为0.3614和0.4451，也明显优于基线系统(即0.3512和0.3291)。有了这些有希望的结果，我们在两个子挑战中都排名前三。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 2nd on Multimodal Sentiment Analysis Challenge

自引率

0.00%

发文量