Multimodal Affective Dimension Prediction Using Deep Bidirectional Long Short-Term Memory Recurrent Neural Networks

Lang He, D. Jiang, Le Yang, Ercheng Pei, Peng Wu, H. Sahli
{"title":"Multimodal Affective Dimension Prediction Using Deep Bidirectional Long Short-Term Memory Recurrent Neural Networks","authors":"Lang He, D. Jiang, Le Yang, Ercheng Pei, Peng Wu, H. Sahli","doi":"10.1145/2808196.2811641","DOIUrl":null,"url":null,"abstract":"This paper presents our system design for the Audio-Visual Emotion Challenge ($AV^{+}EC$ 2015). Besides the baseline features, we extract from audio the functionals on low-level descriptors (LLDs) obtained via the YAAFE toolbox, and from video the Local Phase Quantization from Three Orthogonal Planes (LPQ-TOP) features. From the physiological signals, we extract 52 electro-cardiogram (ECG) features and 22 electro-dermal activity (EDA) features from various analysis domains. The extracted features along with the $AV^{+}EC$ 2015 baseline features of audio, ECG or EDA are concatenated for a further feature selection step, in which the concordance correlation coefficient (CCC), instead of the usual Pearson correlation coefficient (CC), has been used as objective function. In addition, offsets between the features and the arousal/valence labels are considered in both feature selection and modeling of the affective dimensions. For the fusion of multimodal features, we propose a Deep Bidirectional Long Short-Term Memory Recurrent Neural Network (DBLSTM-RNN) based multimodal affect prediction framework, in which the initial predictions from the single modalities via the DBLSTM-RNNs are firstly smoothed with Gaussian smoothing, then input into a second layer of DBLSTM-RNN for the final prediction of affective state. Experimental results show that our proposed features and the DBLSTM-RNN based fusion framework obtain very promising results. On the development set, the obtained CCC is up to 0.824 for arousal and 0.688 for valence, and on the test set, the CCC is 0.747 for arousal and 0.609 for valence.","PeriodicalId":123597,"journal":{"name":"Proceedings of the 5th International Workshop on Audio/Visual Emotion Challenge","volume":"80 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"156","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 5th International Workshop on Audio/Visual Emotion Challenge","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2808196.2811641","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 156

Abstract

This paper presents our system design for the Audio-Visual Emotion Challenge ($AV^{+}EC$ 2015). Besides the baseline features, we extract from audio the functionals on low-level descriptors (LLDs) obtained via the YAAFE toolbox, and from video the Local Phase Quantization from Three Orthogonal Planes (LPQ-TOP) features. From the physiological signals, we extract 52 electro-cardiogram (ECG) features and 22 electro-dermal activity (EDA) features from various analysis domains. The extracted features along with the $AV^{+}EC$ 2015 baseline features of audio, ECG or EDA are concatenated for a further feature selection step, in which the concordance correlation coefficient (CCC), instead of the usual Pearson correlation coefficient (CC), has been used as objective function. In addition, offsets between the features and the arousal/valence labels are considered in both feature selection and modeling of the affective dimensions. For the fusion of multimodal features, we propose a Deep Bidirectional Long Short-Term Memory Recurrent Neural Network (DBLSTM-RNN) based multimodal affect prediction framework, in which the initial predictions from the single modalities via the DBLSTM-RNNs are firstly smoothed with Gaussian smoothing, then input into a second layer of DBLSTM-RNN for the final prediction of affective state. Experimental results show that our proposed features and the DBLSTM-RNN based fusion framework obtain very promising results. On the development set, the obtained CCC is up to 0.824 for arousal and 0.688 for valence, and on the test set, the CCC is 0.747 for arousal and 0.609 for valence.
基于深度双向长短期记忆递归神经网络的多模态情感维度预测
本文介绍了我们为视听情感挑战赛($AV^{+}EC$ 2015)设计的系统。除了基线特征外,我们还从音频中提取了通过YAAFE工具箱获得的低级描述符(LLDs)上的功能,从视频中提取了来自三正交平面的局部相位量化(LPQ-TOP)特征。从生理信号中提取了52个心电图(ECG)特征和22个皮肤电活动(EDA)特征。将提取的特征与音频、ECG或EDA的$AV^{+}EC$ 2015基线特征连接起来进行进一步的特征选择步骤,其中使用一致性相关系数(CCC)而不是通常的Pearson相关系数(CC)作为目标函数。此外,在特征选择和情感维度建模中都考虑了特征与唤醒/效价标签之间的偏移。为了融合多模态特征,我们提出了一种基于深度双向长短期记忆递归神经网络(DBLSTM-RNN)的多模态情感预测框架,该框架首先对DBLSTM-RNN的单模态初始预测进行高斯平滑,然后将其输入到DBLSTM-RNN的第二层进行情感状态的最终预测。实验结果表明,我们提出的特征和基于DBLSTM-RNN的融合框架都取得了很好的效果。在开发集上,获得的CCC在唤醒和效价上高达0.824和0.688,在测试集上,CCC在唤醒和效价上高达0.747和0.609。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信