Study of Dense Network Approaches for Speech Emotion Recognition

2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) Pub Date : 2018-04-30 DOI:10.1109/ICASSP.2018.8461866

Mohammed Abdel-Wahab, C. Busso

{"title":"Study of Dense Network Approaches for Speech Emotion Recognition","authors":"Mohammed Abdel-Wahab, C. Busso","doi":"10.1109/ICASSP.2018.8461866","DOIUrl":null,"url":null,"abstract":"Deep neural networks have been proven to be very effective in various classification problems and show great promise for emotion recognition from speech. Studies have proposed various architectures that further improve the performance of emotion recognition systems. However, there are still various open questions regarding the best approach to building a speech emotion recognition system. Would the system's performance improve if we have more labeled data? How much do we benefit from data augmentation? What activation and regularization schemes are more beneficial? How does the depth of the network affect the performance? We are collecting the MSP-Podcast corpus, a large dataset with over 30 hours of data, which provides an ideal resource to address these questions. This study explores various dense architectures to predict arousal, valence and dominance scores. We investigate varying the training set size, width, and depth of the network, as well as the activation functions used during training. We also study the effect of data augmentation on the network's performance. We find that bigger training set improves the performance. Batch normalization is crucial to achieving a good performance for deeper networks. We do not observe significant differences in the performance in residual networks compared to dense networks.","PeriodicalId":6638,"journal":{"name":"2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"59 1","pages":"5084-5088"},"PeriodicalIF":0.0000,"publicationDate":"2018-04-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"35","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICASSP.2018.8461866","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 35

Abstract

Deep neural networks have been proven to be very effective in various classification problems and show great promise for emotion recognition from speech. Studies have proposed various architectures that further improve the performance of emotion recognition systems. However, there are still various open questions regarding the best approach to building a speech emotion recognition system. Would the system's performance improve if we have more labeled data? How much do we benefit from data augmentation? What activation and regularization schemes are more beneficial? How does the depth of the network affect the performance? We are collecting the MSP-Podcast corpus, a large dataset with over 30 hours of data, which provides an ideal resource to address these questions. This study explores various dense architectures to predict arousal, valence and dominance scores. We investigate varying the training set size, width, and depth of the network, as well as the activation functions used during training. We also study the effect of data augmentation on the network's performance. We find that bigger training set improves the performance. Batch normalization is crucial to achieving a good performance for deeper networks. We do not observe significant differences in the performance in residual networks compared to dense networks.

查看原文本刊更多论文

语音情感识别的密集网络方法研究

深度神经网络已被证明在各种分类问题中非常有效，并在语音情感识别方面显示出巨大的前景。研究提出了各种架构，以进一步提高情绪识别系统的性能。然而，关于构建语音情感识别系统的最佳方法仍然存在各种悬而未决的问题。如果我们有更多的标记数据，系统的性能会提高吗?我们从数据增强中获益多少?哪种激活和正则化方案更有益?网络的深度如何影响性能?我们正在收集MSP-Podcast语料库，这是一个拥有超过30小时数据的大型数据集，它为解决这些问题提供了理想的资源。本研究探索了各种密集结构来预测唤醒、效价和优势得分。我们研究了网络的训练集大小、宽度和深度的变化，以及训练过程中使用的激活函数。我们还研究了数据扩充对网络性能的影响。我们发现更大的训练集可以提高性能。批处理归一化对于实现深度网络的良好性能至关重要。我们没有观察到残差网络与密集网络在性能上的显著差异。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

自引率

0.00%

发文量