What Does it Take to Generalize SER Model Across Datasets? A Comprehensive Benchmark

arXiv - CS - Sound Pub Date : 2024-06-14 DOI:arxiv-2406.09933

Adham Ibrahim, Shady Shehata, Ajinkya Kulkarni, Mukhtar Mohamed, Muhammad Abdul-Mageed

引用次数: 0

Abstract

Speech emotion recognition (SER) is essential for enhancing human-computer interaction in speech-based applications. Despite improvements in specific emotional datasets, there is still a research gap in SER's capability to generalize across real-world situations. In this paper, we investigate approaches to generalize the SER system across different emotion datasets. In particular, we incorporate 11 emotional speech datasets and illustrate a comprehensive benchmark on the SER task. We also address the challenge of imbalanced data distribution using over-sampling methods when combining SER datasets for training. Furthermore, we explore various evaluation protocols for adeptness in the generalization of SER. Building on this, we explore the potential of Whisper for SER, emphasizing the importance of thorough evaluation. Our approach is designed to advance SER technology by integrating speaker-independent methods.

查看原文本刊更多论文

如何在不同数据集之间推广 SER 模型？综合基准

语音情感识别（SER）对于增强语音应用中的人机交互至关重要。尽管在特定情感数据集方面有所改进，但在 SER 在现实世界中的泛化能力方面仍存在研究空白。在本文中，我们研究了将 SER 系统泛化到不同情感数据集的方法。特别是，我们纳入了 11 个情感语音数据集，并说明了 SER 任务的综合基准。我们还利用过度采样方法解决了在结合 SER 数据集进行训练时数据分布不平衡的难题。此外，我们还探索了各种评估协议，以评估 SER 的泛化能力。在此基础上，我们探讨了 Whisper 在 SER 方面的潜力，强调了彻底评估的重要性。我们的方法旨在通过整合与扬声器无关的方法来推动 SER 技术的发展。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

arXiv - CS - Sound

自引率

0.00%

发文量