基于MFCC特征的神经网络语音情感识别

2021 3rd International Conference on Signal Processing and Communication (ICPSC) Pub Date : 2021-05-13 DOI:10.1109/ICSPC51351.2021.9451810

Harshit Dolka, Arul Xavier V M, S. Juliet

{"title":"基于MFCC特征的神经网络语音情感识别","authors":"Harshit Dolka, Arul Xavier V M, S. Juliet","doi":"10.1109/ICSPC51351.2021.9451810","DOIUrl":null,"url":null,"abstract":"Speech Emotion Recognition (SER) is one of the active research topics in Human-Computer Interaction. This paper focuses on training an ANN Model for SER using Mel Frequency Cepstral Coefficients (MFCCs) feature extraction and training it on selected audio datasets to compare the performance. The model can classify audio files based on a total of eight emotional states: happy, sad, angry, surprise, disgust, calm and neutral, although the number of emotions varies in selected datasets. The proposed model gives an average accuracy of 99.52% on the TESS data set, 88.72% on the RAVDESS data set, 71.69% on the CREMA data set, and 86.80% on the SAVEE data set.","PeriodicalId":182885,"journal":{"name":"2021 3rd International Conference on Signal Processing and Communication (ICPSC)","volume":"42 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-05-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"18","resultStr":"{\"title\":\"Speech Emotion Recognition Using ANN on MFCC Features\",\"authors\":\"Harshit Dolka, Arul Xavier V M, S. Juliet\",\"doi\":\"10.1109/ICSPC51351.2021.9451810\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Speech Emotion Recognition (SER) is one of the active research topics in Human-Computer Interaction. This paper focuses on training an ANN Model for SER using Mel Frequency Cepstral Coefficients (MFCCs) feature extraction and training it on selected audio datasets to compare the performance. The model can classify audio files based on a total of eight emotional states: happy, sad, angry, surprise, disgust, calm and neutral, although the number of emotions varies in selected datasets. The proposed model gives an average accuracy of 99.52% on the TESS data set, 88.72% on the RAVDESS data set, 71.69% on the CREMA data set, and 86.80% on the SAVEE data set.\",\"PeriodicalId\":182885,\"journal\":{\"name\":\"2021 3rd International Conference on Signal Processing and Communication (ICPSC)\",\"volume\":\"42 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-05-13\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"18\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2021 3rd International Conference on Signal Processing and Communication (ICPSC)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICSPC51351.2021.9451810\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 3rd International Conference on Signal Processing and Communication (ICPSC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICSPC51351.2021.9451810","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 18

摘要

语音情感识别(SER)是人机交互领域的研究热点之一。本文的重点是使用Mel频率倒谱系数(MFCCs)特征提取来训练一个用于SER的ANN模型，并在选定的音频数据集上进行训练以比较其性能。该模型可以根据八种情绪状态对音频文件进行分类:快乐、悲伤、愤怒、惊讶、厌恶、平静和中性，尽管所选数据集的情绪数量有所不同。该模型在TESS数据集上的平均准确率为99.52%，在RAVDESS数据集上的平均准确率为88.72%，在CREMA数据集上的平均准确率为71.69%，在SAVEE数据集上的平均准确率为86.80%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Speech Emotion Recognition Using ANN on MFCC Features

Speech Emotion Recognition (SER) is one of the active research topics in Human-Computer Interaction. This paper focuses on training an ANN Model for SER using Mel Frequency Cepstral Coefficients (MFCCs) feature extraction and training it on selected audio datasets to compare the performance. The model can classify audio files based on a total of eight emotional states: happy, sad, angry, surprise, disgust, calm and neutral, although the number of emotions varies in selected datasets. The proposed model gives an average accuracy of 99.52% on the TESS data set, 88.72% on the RAVDESS data set, 71.69% on the CREMA data set, and 86.80% on the SAVEE data set.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2021 3rd International Conference on Signal Processing and Communication (ICPSC)

自引率

0.00%

发文量