Explaining Deep Learning Embeddings for Speech Emotion Recognition by Predicting Interpretable Acoustic Features

arXiv - CS - Sound Pub Date : 2024-09-14 DOI:arxiv-2409.09511

Satvik Dixit, Daniel M. Low, Gasser Elbanna, Fabio Catania, Satrajit S. Ghosh

{"title":"Explaining Deep Learning Embeddings for Speech Emotion Recognition by Predicting Interpretable Acoustic Features","authors":"Satvik Dixit, Daniel M. Low, Gasser Elbanna, Fabio Catania, Satrajit S. Ghosh","doi":"arxiv-2409.09511","DOIUrl":null,"url":null,"abstract":"Pre-trained deep learning embeddings have consistently shown superior\nperformance over handcrafted acoustic features in speech emotion recognition\n(SER). However, unlike acoustic features with clear physical meaning, these\nembeddings lack clear interpretability. Explaining these embeddings is crucial\nfor building trust in healthcare and security applications and advancing the\nscientific understanding of the acoustic information that is encoded in them.\nThis paper proposes a modified probing approach to explain deep learning\nembeddings in the SER space. We predict interpretable acoustic features (e.g.,\nf0, loudness) from (i) the complete set of embeddings and (ii) a subset of the\nembedding dimensions identified as most important for predicting each emotion.\nIf the subset of the most important dimensions better predicts a given emotion\nthan all dimensions and also predicts specific acoustic features more\naccurately, we infer those acoustic features are important for the embedding\nmodel for the given task. We conducted experiments using the WavLM embeddings\nand eGeMAPS acoustic features as audio representations, applying our method to\nthe RAVDESS and SAVEE emotional speech datasets. Based on this evaluation, we\ndemonstrate that Energy, Frequency, Spectral, and Temporal categories of\nacoustic features provide diminishing information to SER in that order,\ndemonstrating the utility of the probing classifier method to relate embeddings\nto interpretable acoustic features.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":"50 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Sound","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.09511","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Pre-trained deep learning embeddings have consistently shown superior performance over handcrafted acoustic features in speech emotion recognition (SER). However, unlike acoustic features with clear physical meaning, these embeddings lack clear interpretability. Explaining these embeddings is crucial for building trust in healthcare and security applications and advancing the scientific understanding of the acoustic information that is encoded in them. This paper proposes a modified probing approach to explain deep learning embeddings in the SER space. We predict interpretable acoustic features (e.g., f0, loudness) from (i) the complete set of embeddings and (ii) a subset of the embedding dimensions identified as most important for predicting each emotion. If the subset of the most important dimensions better predicts a given emotion than all dimensions and also predicts specific acoustic features more accurately, we infer those acoustic features are important for the embedding model for the given task. We conducted experiments using the WavLM embeddings and eGeMAPS acoustic features as audio representations, applying our method to the RAVDESS and SAVEE emotional speech datasets. Based on this evaluation, we demonstrate that Energy, Frequency, Spectral, and Temporal categories of acoustic features provide diminishing information to SER in that order, demonstrating the utility of the probing classifier method to relate embeddings to interpretable acoustic features.

查看原文本刊更多论文

通过预测可解释的声学特征来解释用于语音情感识别的深度学习嵌入式算法

在语音情感识别（SER）中，预训练的深度学习嵌入一直显示出优于手工制作的声学特征的性能。然而，与具有明确物理意义的声学特征不同，这些嵌入缺乏明确的可解释性。解释这些嵌入对于在医疗保健和安全应用中建立信任以及推进对其中编码的声学信息的科学理解至关重要。如果最重要维度的子集比所有维度都能更好地预测特定情绪，并且能更准确地预测特定声学特征，那么我们就能推断出这些声学特征对于特定任务的嵌入模型非常重要。我们使用 WavLM 嵌入和 eGeMAPS 声音特征作为音频表示进行了实验，并将我们的方法应用于 RAVDESS 和 SAVEE 情感语音数据集。基于这一评估，我们证明了声学特征的能量、频率、频谱和时间类别依次为 SER 提供了递减信息，这证明了探测分类器方法在将嵌入与可解释的声学特征相关联方面的实用性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

arXiv - CS - Sound

自引率

0.00%

发文量