Multi-Modal Audio, Video and Physiological Sensor Learning for Continuous Emotion Prediction

K. Brady, Youngjune Gwon, Pooya Khorrami, Elizabeth Godoy, W. Campbell, Charlie K. Dagli, Thomas S. Huang
{"title":"Multi-Modal Audio, Video and Physiological Sensor Learning for Continuous Emotion Prediction","authors":"K. Brady, Youngjune Gwon, Pooya Khorrami, Elizabeth Godoy, W. Campbell, Charlie K. Dagli, Thomas S. Huang","doi":"10.1145/2988257.2988264","DOIUrl":null,"url":null,"abstract":"The automatic determination of emotional state from multimedia content is an inherently challenging problem with a broad range of applications including biomedical diagnostics, multimedia retrieval, and human computer interfaces. The Audio Video Emotion Challenge (AVEC) 2016 provides a well-defined framework for developing and rigorously evaluating innovative approaches for estimating the arousal and valence states of emotion as a function of time. It presents the opportunity for investigating multimodal solutions that include audio, video, and physiological sensor signals. This paper provides an overview of our AVEC Emotion Challenge system, which uses multi-feature learning and fusion across all available modalities. It includes a number of technical contributions, including the development of novel high- and low-level features for modeling emotion in the audio, video, and physiological channels. Low-level features include modeling arousal in audio with minimal prosodic-based descriptors. High-level features are derived from supervised and unsupervised machine learning approaches based on sparse coding and deep learning. Finally, a state space estimation approach is applied for score fusion that demonstrates the importance of exploiting the time-series nature of the arousal and valence states. The resulting system outperforms the baseline systems [10] on the test evaluation set with an achieved Concordant Correlation Coefficient (CCC) for arousal of 0.770 vs 0.702 (baseline) and for valence of 0.687 vs 0.638. Future work will focus on exploiting the time-varying nature of individual channels in the multi-modal framework.","PeriodicalId":432793,"journal":{"name":"Proceedings of the 6th International Workshop on Audio/Visual Emotion Challenge","volume":"3 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-10-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"90","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 6th International Workshop on Audio/Visual Emotion Challenge","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2988257.2988264","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 90

Abstract

The automatic determination of emotional state from multimedia content is an inherently challenging problem with a broad range of applications including biomedical diagnostics, multimedia retrieval, and human computer interfaces. The Audio Video Emotion Challenge (AVEC) 2016 provides a well-defined framework for developing and rigorously evaluating innovative approaches for estimating the arousal and valence states of emotion as a function of time. It presents the opportunity for investigating multimodal solutions that include audio, video, and physiological sensor signals. This paper provides an overview of our AVEC Emotion Challenge system, which uses multi-feature learning and fusion across all available modalities. It includes a number of technical contributions, including the development of novel high- and low-level features for modeling emotion in the audio, video, and physiological channels. Low-level features include modeling arousal in audio with minimal prosodic-based descriptors. High-level features are derived from supervised and unsupervised machine learning approaches based on sparse coding and deep learning. Finally, a state space estimation approach is applied for score fusion that demonstrates the importance of exploiting the time-series nature of the arousal and valence states. The resulting system outperforms the baseline systems [10] on the test evaluation set with an achieved Concordant Correlation Coefficient (CCC) for arousal of 0.770 vs 0.702 (baseline) and for valence of 0.687 vs 0.638. Future work will focus on exploiting the time-varying nature of individual channels in the multi-modal framework.
用于持续情绪预测的多模态音频、视频和生理传感器学习
从多媒体内容中自动确定情绪状态是一个具有挑战性的问题,具有广泛的应用范围,包括生物医学诊断、多媒体检索和人机界面。2016年音频视频情绪挑战(AVEC)为开发和严格评估评估情绪觉醒和价态随时间变化的创新方法提供了一个明确的框架。它为研究包括音频、视频和生理传感器信号在内的多模式解决方案提供了机会。本文概述了我们的AVEC情绪挑战系统,该系统使用跨所有可用模式的多特征学习和融合。它包括许多技术贡献,包括开发用于在音频、视频和生理通道中建模情感的新颖的高级和低级特征。低级功能包括用最小的基于韵律的描述符在音频中建模唤醒。高级特征来源于基于稀疏编码和深度学习的有监督和无监督机器学习方法。最后,将状态空间估计方法应用于分数融合,证明了利用唤醒态和价态的时间序列特性的重要性。由此产生的系统在测试评估集上优于基线系统[10],其一致性相关系数(CCC)的唤醒值为0.770 vs 0.702(基线),效价为0.687 vs 0.638。未来的工作将侧重于利用多模态框架中单个渠道的时变特性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信