Explaining Deep Learning Embeddings for Speech Emotion Recognition by Predicting Interpretable Acoustic Features

Satvik Dixit, Daniel M. Low, Gasser Elbanna, Fabio Catania, Satrajit S. Ghosh
{"title":"Explaining Deep Learning Embeddings for Speech Emotion Recognition by Predicting Interpretable Acoustic Features","authors":"Satvik Dixit, Daniel M. Low, Gasser Elbanna, Fabio Catania, Satrajit S. Ghosh","doi":"arxiv-2409.09511","DOIUrl":null,"url":null,"abstract":"Pre-trained deep learning embeddings have consistently shown superior\nperformance over handcrafted acoustic features in speech emotion recognition\n(SER). However, unlike acoustic features with clear physical meaning, these\nembeddings lack clear interpretability. Explaining these embeddings is crucial\nfor building trust in healthcare and security applications and advancing the\nscientific understanding of the acoustic information that is encoded in them.\nThis paper proposes a modified probing approach to explain deep learning\nembeddings in the SER space. We predict interpretable acoustic features (e.g.,\nf0, loudness) from (i) the complete set of embeddings and (ii) a subset of the\nembedding dimensions identified as most important for predicting each emotion.\nIf the subset of the most important dimensions better predicts a given emotion\nthan all dimensions and also predicts specific acoustic features more\naccurately, we infer those acoustic features are important for the embedding\nmodel for the given task. We conducted experiments using the WavLM embeddings\nand eGeMAPS acoustic features as audio representations, applying our method to\nthe RAVDESS and SAVEE emotional speech datasets. Based on this evaluation, we\ndemonstrate that Energy, Frequency, Spectral, and Temporal categories of\nacoustic features provide diminishing information to SER in that order,\ndemonstrating the utility of the probing classifier method to relate embeddings\nto interpretable acoustic features.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Sound","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.09511","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Pre-trained deep learning embeddings have consistently shown superior performance over handcrafted acoustic features in speech emotion recognition (SER). However, unlike acoustic features with clear physical meaning, these embeddings lack clear interpretability. Explaining these embeddings is crucial for building trust in healthcare and security applications and advancing the scientific understanding of the acoustic information that is encoded in them. This paper proposes a modified probing approach to explain deep learning embeddings in the SER space. We predict interpretable acoustic features (e.g., f0, loudness) from (i) the complete set of embeddings and (ii) a subset of the embedding dimensions identified as most important for predicting each emotion. If the subset of the most important dimensions better predicts a given emotion than all dimensions and also predicts specific acoustic features more accurately, we infer those acoustic features are important for the embedding model for the given task. We conducted experiments using the WavLM embeddings and eGeMAPS acoustic features as audio representations, applying our method to the RAVDESS and SAVEE emotional speech datasets. Based on this evaluation, we demonstrate that Energy, Frequency, Spectral, and Temporal categories of acoustic features provide diminishing information to SER in that order, demonstrating the utility of the probing classifier method to relate embeddings to interpretable acoustic features.
通过预测可解释的声学特征来解释用于语音情感识别的深度学习嵌入式算法
在语音情感识别(SER)中,预训练的深度学习嵌入一直显示出优于手工制作的声学特征的性能。然而,与具有明确物理意义的声学特征不同,这些嵌入缺乏明确的可解释性。解释这些嵌入对于在医疗保健和安全应用中建立信任以及推进对其中编码的声学信息的科学理解至关重要。如果最重要维度的子集比所有维度都能更好地预测特定情绪,并且能更准确地预测特定声学特征,那么我们就能推断出这些声学特征对于特定任务的嵌入模型非常重要。我们使用 WavLM 嵌入和 eGeMAPS 声音特征作为音频表示进行了实验,并将我们的方法应用于 RAVDESS 和 SAVEE 情感语音数据集。基于这一评估,我们证明了声学特征的能量、频率、频谱和时间类别依次为 SER 提供了递减信息,这证明了探测分类器方法在将嵌入与可解释的声学特征相关联方面的实用性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信