{"title":"Stimulus Modality Matters: Impact of Perceptual Evaluations from Different Modalities on Speech Emotion Recognition System Performance","authors":"Huang-Cheng Chou, Haibin Wu, Chi-Chun Lee","doi":"arxiv-2409.10762","DOIUrl":null,"url":null,"abstract":"Speech Emotion Recognition (SER) systems rely on speech input and emotional\nlabels annotated by humans. However, various emotion databases collect\nperceptional evaluations in different ways. For instance, the IEMOCAP dataset\nuses video clips with sounds for annotators to provide their emotional\nperceptions. However, the most significant English emotion dataset, the\nMSP-PODCAST, only provides speech for raters to choose the emotional ratings.\nNevertheless, using speech as input is the standard approach to training SER\nsystems. Therefore, the open question is the emotional labels elicited by which\nscenarios are the most effective for training SER systems. We comprehensively\ncompare the effectiveness of SER systems trained with labels elicited by\ndifferent modality stimuli and evaluate the SER systems on various testing\nconditions. Also, we introduce an all-inclusive label that combines all labels\nelicited by various modalities. We show that using labels elicited by\nvoice-only stimuli for training yields better performance on the test set,\nwhereas labels elicited by voice-only stimuli.","PeriodicalId":501034,"journal":{"name":"arXiv - EE - Signal Processing","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - EE - Signal Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.10762","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Speech Emotion Recognition (SER) systems rely on speech input and emotional
labels annotated by humans. However, various emotion databases collect
perceptional evaluations in different ways. For instance, the IEMOCAP dataset
uses video clips with sounds for annotators to provide their emotional
perceptions. However, the most significant English emotion dataset, the
MSP-PODCAST, only provides speech for raters to choose the emotional ratings.
Nevertheless, using speech as input is the standard approach to training SER
systems. Therefore, the open question is the emotional labels elicited by which
scenarios are the most effective for training SER systems. We comprehensively
compare the effectiveness of SER systems trained with labels elicited by
different modality stimuli and evaluate the SER systems on various testing
conditions. Also, we introduce an all-inclusive label that combines all labels
elicited by various modalities. We show that using labels elicited by
voice-only stimuli for training yields better performance on the test set,
whereas labels elicited by voice-only stimuli.
语音情感识别(SER)系统依赖于语音输入和人类注释的情感标签。然而,各种情感数据库收集感知评估的方式各不相同。例如,IEMOCAP 数据集利用带有声音的视频片段让注释者提供他们的情感感知。然而,最重要的英语情感数据集--MSP-PODCAST--只提供语音供评分者选择情感评分。然而,使用语音作为输入是训练 SER 系统的标准方法。然而,使用语音作为输入是训练 SER 系统的标准方法。因此,一个悬而未决的问题是,哪种情景下激发的情感标签对训练 SER 系统最有效。我们全面比较了使用不同模式刺激激发的标签训练 SER 系统的效果,并在各种测试条件下对 SER 系统进行了评估。此外,我们还引入了一种包罗万象的标签,它结合了各种模态激发的所有标签。我们的研究表明,使用纯声音刺激激发的标签进行训练,在测试集上的表现比使用纯声音刺激激发的标签更好。