Confidence Estimation for Speech Emotion Recognition Based on the Relationship Between Emotion Categories and Primitives

ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) Pub Date : 2022-05-23 DOI:10.1109/ICASSP43922.2022.9746930

Y. Li, C. Papayiannis, Viktor Rozgic, Elizabeth Shriberg, Chao Wang

{"title":"Confidence Estimation for Speech Emotion Recognition Based on the Relationship Between Emotion Categories and Primitives","authors":"Y. Li, C. Papayiannis, Viktor Rozgic, Elizabeth Shriberg, Chao Wang","doi":"10.1109/ICASSP43922.2022.9746930","DOIUrl":null,"url":null,"abstract":"Confidence estimation for Speech Emotion Recognition (SER) is instrumental in improving the reliability in the behavior of downstream applications. In this work we propose (1) a novel confidence metric for SER based on the relationship between emotion primitives: arousal, valence, and dominance (AVD) and emotion categories (ECs), (2) EmoConfidNet - a DNN trained alongside the EC recognizer to predict the proposed confidence metric, and (3) a data filtering technique used to enhance the training of EmoConfidNet and the EC recognizer. For each training sample, we calculate distances from corresponding AVD annotation vectors to centroids of each EC in the AVD space, and define EC confidences as functions of the evaluated distances. EmoConfidNet is trained to predict confidence from the same acoustic representations used to train the EC recognizer. EmoConfidNet outperforms state-of-the-art confidence estimation methods on the MSP-Podcast and IEMOCAP datasets. For a fixed EC recognizer, after we reject the same number of low confidence predictions using EmoConfidNet, we achieve a higher F1 and unweighted average recall (UAR) than when rejecting using other methods.","PeriodicalId":272439,"journal":{"name":"ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"42 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-05-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICASSP43922.2022.9746930","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

Abstract

Confidence estimation for Speech Emotion Recognition (SER) is instrumental in improving the reliability in the behavior of downstream applications. In this work we propose (1) a novel confidence metric for SER based on the relationship between emotion primitives: arousal, valence, and dominance (AVD) and emotion categories (ECs), (2) EmoConfidNet - a DNN trained alongside the EC recognizer to predict the proposed confidence metric, and (3) a data filtering technique used to enhance the training of EmoConfidNet and the EC recognizer. For each training sample, we calculate distances from corresponding AVD annotation vectors to centroids of each EC in the AVD space, and define EC confidences as functions of the evaluated distances. EmoConfidNet is trained to predict confidence from the same acoustic representations used to train the EC recognizer. EmoConfidNet outperforms state-of-the-art confidence estimation methods on the MSP-Podcast and IEMOCAP datasets. For a fixed EC recognizer, after we reject the same number of low confidence predictions using EmoConfidNet, we achieve a higher F1 and unweighted average recall (UAR) than when rejecting using other methods.

查看原文本刊更多论文

基于情感类别与原语关系的语音情感识别置信度估计

语音情感识别(SER)的置信度估计有助于提高下游应用行为的可靠性。在这项工作中，我们提出了(1)基于情绪原语:唤醒、效价和优势(AVD)和情绪类别(ECs)之间关系的新的SER置信度量，(2)emoconfetnet -与EC识别器一起训练的DNN来预测拟议的置信度量，以及(3)用于增强emoconfetnet和EC识别器训练的数据过滤技术。对于每个训练样本，我们计算对应的AVD注释向量到AVD空间中每个EC质心的距离，并将EC置信度定义为评估距离的函数。emoconfetnet经过训练，可以从与训练EC识别器相同的声学表示中预测置信度。在MSP-Podcast和IEMOCAP数据集上，emoconfnet优于最先进的置信度估计方法。对于一个固定的EC识别器，当我们使用emoconfnet拒绝相同数量的低置信度预测后，我们获得了比使用其他方法拒绝时更高的F1和未加权平均召回率(UAR)。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

自引率

0.00%

发文量