Multimodal Affective Dimension Prediction Using Deep Bidirectional Long Short-Term Memory Recurrent Neural Networks

Proceedings of the 5th International Workshop on Audio/Visual Emotion Challenge Pub Date : 2015-10-26 DOI:10.1145/2808196.2811641

Lang He, D. Jiang, Le Yang, Ercheng Pei, Peng Wu, H. Sahli

{"title":"Multimodal Affective Dimension Prediction Using Deep Bidirectional Long Short-Term Memory Recurrent Neural Networks","authors":"Lang He, D. Jiang, Le Yang, Ercheng Pei, Peng Wu, H. Sahli","doi":"10.1145/2808196.2811641","DOIUrl":null,"url":null,"abstract":"This paper presents our system design for the Audio-Visual Emotion Challenge ($AV^{+}EC$ 2015). Besides the baseline features, we extract from audio the functionals on low-level descriptors (LLDs) obtained via the YAAFE toolbox, and from video the Local Phase Quantization from Three Orthogonal Planes (LPQ-TOP) features. From the physiological signals, we extract 52 electro-cardiogram (ECG) features and 22 electro-dermal activity (EDA) features from various analysis domains. The extracted features along with the $AV^{+}EC$ 2015 baseline features of audio, ECG or EDA are concatenated for a further feature selection step, in which the concordance correlation coefficient (CCC), instead of the usual Pearson correlation coefficient (CC), has been used as objective function. In addition, offsets between the features and the arousal/valence labels are considered in both feature selection and modeling of the affective dimensions. For the fusion of multimodal features, we propose a Deep Bidirectional Long Short-Term Memory Recurrent Neural Network (DBLSTM-RNN) based multimodal affect prediction framework, in which the initial predictions from the single modalities via the DBLSTM-RNNs are firstly smoothed with Gaussian smoothing, then input into a second layer of DBLSTM-RNN for the final prediction of affective state. Experimental results show that our proposed features and the DBLSTM-RNN based fusion framework obtain very promising results. On the development set, the obtained CCC is up to 0.824 for arousal and 0.688 for valence, and on the test set, the CCC is 0.747 for arousal and 0.609 for valence.","PeriodicalId":123597,"journal":{"name":"Proceedings of the 5th International Workshop on Audio/Visual Emotion Challenge","volume":"80 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"156","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 5th International Workshop on Audio/Visual Emotion Challenge","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2808196.2811641","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 156

Abstract

This paper presents our system design for the Audio-Visual Emotion Challenge ($AV^{+}EC$ 2015). Besides the baseline features, we extract from audio the functionals on low-level descriptors (LLDs) obtained via the YAAFE toolbox, and from video the Local Phase Quantization from Three Orthogonal Planes (LPQ-TOP) features. From the physiological signals, we extract 52 electro-cardiogram (ECG) features and 22 electro-dermal activity (EDA) features from various analysis domains. The extracted features along with the $AV^{+}EC$ 2015 baseline features of audio, ECG or EDA are concatenated for a further feature selection step, in which the concordance correlation coefficient (CCC), instead of the usual Pearson correlation coefficient (CC), has been used as objective function. In addition, offsets between the features and the arousal/valence labels are considered in both feature selection and modeling of the affective dimensions. For the fusion of multimodal features, we propose a Deep Bidirectional Long Short-Term Memory Recurrent Neural Network (DBLSTM-RNN) based multimodal affect prediction framework, in which the initial predictions from the single modalities via the DBLSTM-RNNs are firstly smoothed with Gaussian smoothing, then input into a second layer of DBLSTM-RNN for the final prediction of affective state. Experimental results show that our proposed features and the DBLSTM-RNN based fusion framework obtain very promising results. On the development set, the obtained CCC is up to 0.824 for arousal and 0.688 for valence, and on the test set, the CCC is 0.747 for arousal and 0.609 for valence.

查看原文本刊更多论文

基于深度双向长短期记忆递归神经网络的多模态情感维度预测

本文介绍了我们为视听情感挑战赛($AV^{+}EC$ 2015)设计的系统。除了基线特征外，我们还从音频中提取了通过YAAFE工具箱获得的低级描述符(LLDs)上的功能，从视频中提取了来自三正交平面的局部相位量化(LPQ-TOP)特征。从生理信号中提取了52个心电图(ECG)特征和22个皮肤电活动(EDA)特征。将提取的特征与音频、ECG或EDA的$AV^{+}EC$ 2015基线特征连接起来进行进一步的特征选择步骤，其中使用一致性相关系数(CCC)而不是通常的Pearson相关系数(CC)作为目标函数。此外，在特征选择和情感维度建模中都考虑了特征与唤醒/效价标签之间的偏移。为了融合多模态特征，我们提出了一种基于深度双向长短期记忆递归神经网络(DBLSTM-RNN)的多模态情感预测框架，该框架首先对DBLSTM-RNN的单模态初始预测进行高斯平滑，然后将其输入到DBLSTM-RNN的第二层进行情感状态的最终预测。实验结果表明，我们提出的特征和基于DBLSTM-RNN的融合框架都取得了很好的效果。在开发集上，获得的CCC在唤醒和效价上高达0.824和0.688，在测试集上，CCC在唤醒和效价上高达0.747和0.609。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 5th International Workshop on Audio/Visual Emotion Challenge

自引率

0.00%

发文量