Speech-based Continuous Emotion Prediction by Learning Perception Responses related to Salient Events: A Study based on Vocal Affect Bursts and Cross-Cultural Affect in AVEC 2018

Proceedings of the 2018 on Audio/Visual Emotion Challenge and Workshop Pub Date : 2018-10-15 DOI:10.1145/3266302.3266314

Kalani Wataraka Gamage, T. Dang, V. Sethu, J. Epps, E. Ambikairajah

{"title":"Speech-based Continuous Emotion Prediction by Learning Perception Responses related to Salient Events: A Study based on Vocal Affect Bursts and Cross-Cultural Affect in AVEC 2018","authors":"Kalani Wataraka Gamage, T. Dang, V. Sethu, J. Epps, E. Ambikairajah","doi":"10.1145/3266302.3266314","DOIUrl":null,"url":null,"abstract":"This paper presents a novel framework for speech-based continuous emotion prediction. The proposed model characterises the perceived emotion estimation as time-invariant responses to salient events. Then arousal and valence variation over time is modelded as the ouput of a parallel array of time-invariant filters where each filter represents a salient event in this context, and the impulse response of the filter represents the learned perception emotion response. The proposed model is evaluted by considering vocal affect bursts/non-verbal vocal gestures as salient event candidates. The proposed model is validated based on the development dataset of AVEC 2018 challenge development dataset and achieves the highest accuracy of valence prediction among single modal methods based on speech or speech-transcript. We tested this model on cross-cultural settings provided by AVEC 2018 challenge test set, and the model performs reasonably well for an unseen culture as well and outperform speech-based baselines. Further we explore inclusion of interlocutor related cues to the proposed model and decision level fusion with existing features. Since the proposed model was evaluated solely based on laughter and slight laughter affect bursts which were nominated as salient by proposed saliency constrains of the model, the results presented highlight the significance of aforementioned gestures in human emotion expression and perception","PeriodicalId":123523,"journal":{"name":"Proceedings of the 2018 on Audio/Visual Emotion Challenge and Workshop","volume":"396 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"11","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2018 on Audio/Visual Emotion Challenge and Workshop","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3266302.3266314","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 11

Abstract

This paper presents a novel framework for speech-based continuous emotion prediction. The proposed model characterises the perceived emotion estimation as time-invariant responses to salient events. Then arousal and valence variation over time is modelded as the ouput of a parallel array of time-invariant filters where each filter represents a salient event in this context, and the impulse response of the filter represents the learned perception emotion response. The proposed model is evaluted by considering vocal affect bursts/non-verbal vocal gestures as salient event candidates. The proposed model is validated based on the development dataset of AVEC 2018 challenge development dataset and achieves the highest accuracy of valence prediction among single modal methods based on speech or speech-transcript. We tested this model on cross-cultural settings provided by AVEC 2018 challenge test set, and the model performs reasonably well for an unseen culture as well and outperform speech-based baselines. Further we explore inclusion of interlocutor related cues to the proposed model and decision level fusion with existing features. Since the proposed model was evaluated solely based on laughter and slight laughter affect bursts which were nominated as salient by proposed saliency constrains of the model, the results presented highlight the significance of aforementioned gestures in human emotion expression and perception

查看原文本刊更多论文

基于显著事件学习知觉反应的基于语音的持续情绪预测——基于声音情感爆发和跨文化情感的研究

本文提出了一种新的基于语音的连续情绪预测框架。该模型将感知情绪估计描述为对显著事件的时不变反应。然后，唤醒和效价随时间的变化被建模为一个平行的定常滤波器阵列的输出，其中每个滤波器代表在这种情况下的一个突出事件，滤波器的脉冲响应代表学习感知情绪反应。通过考虑声音情感爆发/非语言声音手势作为突出事件候选来评估所提出的模型。基于AVEC 2018挑战发展数据集的开发数据集对该模型进行了验证，在基于语音或基于语音-transcript的单模态方法中，该模型的价态预测准确率最高。我们在AVEC 2018挑战测试集提供的跨文化环境中测试了该模型，该模型在未知文化中也表现得相当好，优于基于语音的基线。我们进一步探讨了将对话者相关线索纳入所提出的模型以及与现有特征的决策级融合。由于所提出的模型仅基于被提出的模型显著性约束指定为显著性的笑声和轻微笑声影响爆发来评估，因此所提出的结果突出了上述手势在人类情感表达和感知中的重要性

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 2018 on Audio/Visual Emotion Challenge and Workshop

自引率

0.00%

发文量