{"title":"Gender Differentiated Convolutional Neural Networks for Speech Emotion Recognition","authors":"P. Mishra, Ruchir Sharma","doi":"10.1109/ICUMT51630.2020.9222412","DOIUrl":null,"url":null,"abstract":"This paper proposes a two-stage gender-differentiated system for Speech Emotion Recognition using Mel-frequency Cepstral Coefficients and Convolutional Neural Networks. Acoustical variances between male and female speakers pose a problem and it is established that gender-dependent emotion recognizers perform better than gender-independent ones. The provided solution can recognize seven emotions (anger, disgust, fear, happiness, sadness, surprise, and neutral state). Data augmentation is used to compensate for the lack of quality data, with the raw speech samples derived from four datasets, namely: RAVDESS, CREMA-D, SAVEE, and TESS. The system is composed of two stages: 1) gender classification and; 2) emotion classification. The output of the gender classifier in the first stage determines the gender-specific classifier for the second stage. The experimental evaluation displays the performance in terms of the correct emotion recognition rate of the proposed SER model. The results demonstrate that a gender-differentiated system significantly improves performance. The obtained results also show that using Global Average Pooling instead of a fully-connected network at the end of the CNN classifier further improves the performance. Future implementations of this proposed system may allow effective human-computer intelligent interaction.","PeriodicalId":170847,"journal":{"name":"2020 12th International Congress on Ultra Modern Telecommunications and Control Systems and Workshops (ICUMT)","volume":"46 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 12th International Congress on Ultra Modern Telecommunications and Control Systems and Workshops (ICUMT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICUMT51630.2020.9222412","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 4
Abstract
This paper proposes a two-stage gender-differentiated system for Speech Emotion Recognition using Mel-frequency Cepstral Coefficients and Convolutional Neural Networks. Acoustical variances between male and female speakers pose a problem and it is established that gender-dependent emotion recognizers perform better than gender-independent ones. The provided solution can recognize seven emotions (anger, disgust, fear, happiness, sadness, surprise, and neutral state). Data augmentation is used to compensate for the lack of quality data, with the raw speech samples derived from four datasets, namely: RAVDESS, CREMA-D, SAVEE, and TESS. The system is composed of two stages: 1) gender classification and; 2) emotion classification. The output of the gender classifier in the first stage determines the gender-specific classifier for the second stage. The experimental evaluation displays the performance in terms of the correct emotion recognition rate of the proposed SER model. The results demonstrate that a gender-differentiated system significantly improves performance. The obtained results also show that using Global Average Pooling instead of a fully-connected network at the end of the CNN classifier further improves the performance. Future implementations of this proposed system may allow effective human-computer intelligent interaction.
本文提出了一种基于mel频率倒谱系数和卷积神经网络的两阶段性别区分语音情感识别系统。男性和女性说话者之间的声音差异构成了一个问题,性别依赖的情绪识别器比性别独立的情绪识别器表现得更好。提供的解决方案可以识别七种情绪(愤怒,厌恶,恐惧,快乐,悲伤,惊讶和中性状态)。数据增强用于弥补质量数据的不足,原始语音样本来自四个数据集,即:RAVDESS, CREMA-D, SAVEE和TESS。该系统由两个阶段组成:1)性别分类和;2)情绪分类。第一阶段性别分类器的输出决定了第二阶段的性别分类器。实验评价显示了所提出的SER模型在情绪识别率方面的性能。结果表明,性别区分系统显著提高了绩效。得到的结果还表明,在CNN分类器的末端使用Global Average Pooling而不是全连接网络进一步提高了性能。该系统的未来实现可能允许有效的人机智能交互。