K. Brady, Youngjune Gwon, Pooya Khorrami, Elizabeth Godoy, W. Campbell, Charlie K. Dagli, Thomas S. Huang
{"title":"Multi-Modal Audio, Video and Physiological Sensor Learning for Continuous Emotion Prediction","authors":"K. Brady, Youngjune Gwon, Pooya Khorrami, Elizabeth Godoy, W. Campbell, Charlie K. Dagli, Thomas S. Huang","doi":"10.1145/2988257.2988264","DOIUrl":null,"url":null,"abstract":"The automatic determination of emotional state from multimedia content is an inherently challenging problem with a broad range of applications including biomedical diagnostics, multimedia retrieval, and human computer interfaces. The Audio Video Emotion Challenge (AVEC) 2016 provides a well-defined framework for developing and rigorously evaluating innovative approaches for estimating the arousal and valence states of emotion as a function of time. It presents the opportunity for investigating multimodal solutions that include audio, video, and physiological sensor signals. This paper provides an overview of our AVEC Emotion Challenge system, which uses multi-feature learning and fusion across all available modalities. It includes a number of technical contributions, including the development of novel high- and low-level features for modeling emotion in the audio, video, and physiological channels. Low-level features include modeling arousal in audio with minimal prosodic-based descriptors. High-level features are derived from supervised and unsupervised machine learning approaches based on sparse coding and deep learning. Finally, a state space estimation approach is applied for score fusion that demonstrates the importance of exploiting the time-series nature of the arousal and valence states. The resulting system outperforms the baseline systems [10] on the test evaluation set with an achieved Concordant Correlation Coefficient (CCC) for arousal of 0.770 vs 0.702 (baseline) and for valence of 0.687 vs 0.638. Future work will focus on exploiting the time-varying nature of individual channels in the multi-modal framework.","PeriodicalId":432793,"journal":{"name":"Proceedings of the 6th International Workshop on Audio/Visual Emotion Challenge","volume":"3 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-10-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"90","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 6th International Workshop on Audio/Visual Emotion Challenge","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2988257.2988264","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 90
Abstract
The automatic determination of emotional state from multimedia content is an inherently challenging problem with a broad range of applications including biomedical diagnostics, multimedia retrieval, and human computer interfaces. The Audio Video Emotion Challenge (AVEC) 2016 provides a well-defined framework for developing and rigorously evaluating innovative approaches for estimating the arousal and valence states of emotion as a function of time. It presents the opportunity for investigating multimodal solutions that include audio, video, and physiological sensor signals. This paper provides an overview of our AVEC Emotion Challenge system, which uses multi-feature learning and fusion across all available modalities. It includes a number of technical contributions, including the development of novel high- and low-level features for modeling emotion in the audio, video, and physiological channels. Low-level features include modeling arousal in audio with minimal prosodic-based descriptors. High-level features are derived from supervised and unsupervised machine learning approaches based on sparse coding and deep learning. Finally, a state space estimation approach is applied for score fusion that demonstrates the importance of exploiting the time-series nature of the arousal and valence states. The resulting system outperforms the baseline systems [10] on the test evaluation set with an achieved Concordant Correlation Coefficient (CCC) for arousal of 0.770 vs 0.702 (baseline) and for valence of 0.687 vs 0.638. Future work will focus on exploiting the time-varying nature of individual channels in the multi-modal framework.
从多媒体内容中自动确定情绪状态是一个具有挑战性的问题,具有广泛的应用范围,包括生物医学诊断、多媒体检索和人机界面。2016年音频视频情绪挑战(AVEC)为开发和严格评估评估情绪觉醒和价态随时间变化的创新方法提供了一个明确的框架。它为研究包括音频、视频和生理传感器信号在内的多模式解决方案提供了机会。本文概述了我们的AVEC情绪挑战系统,该系统使用跨所有可用模式的多特征学习和融合。它包括许多技术贡献,包括开发用于在音频、视频和生理通道中建模情感的新颖的高级和低级特征。低级功能包括用最小的基于韵律的描述符在音频中建模唤醒。高级特征来源于基于稀疏编码和深度学习的有监督和无监督机器学习方法。最后,将状态空间估计方法应用于分数融合,证明了利用唤醒态和价态的时间序列特性的重要性。由此产生的系统在测试评估集上优于基线系统[10],其一致性相关系数(CCC)的唤醒值为0.770 vs 0.702(基线),效价为0.687 vs 0.638。未来的工作将侧重于利用多模态框架中单个渠道的时变特性。