{"title":"基于卷积神经网络的语音数据情感识别","authors":"M. H. Pham, F. Noori, J. Tørresen","doi":"10.1109/scc53769.2021.9768372","DOIUrl":null,"url":null,"abstract":"Identifying emotion from speech has a wide range of applications and has drawn special interests in research to improve the human-computer interaction experience. Traditional machine learning approaches usually face the challenge of selecting the optimal feature set for each application. Deep learning, on the other hand, allows end-to-end development of the models and inherent feature extraction. In this study, we evaluate the performance of Convolutional Neural Network on different kinds of spectral features of acoustic signal collections, from two popular open databases Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) and Berlin Database of Emotional Speech (EmoDB). Two-to-eight classes of emotions (RAVDESS) and two-to-seven classes of emotions (EmoDB) are identified by the deep learning model. The results, in terms of unweighted average recall, are 0.888 (two classes) and 0.694 (eight classes) for the RAVDESS dataset. The corresponding results for the EmoDB dataset are 0.993 (two classes) and 0.764 (seven classes)","PeriodicalId":365845,"journal":{"name":"2021 IEEE 2nd International Conference on Signal, Control and Communication (SCC)","volume":"90 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-12-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":"{\"title\":\"Emotion Recognition using Speech Data with Convolutional Neural Network\",\"authors\":\"M. H. Pham, F. Noori, J. Tørresen\",\"doi\":\"10.1109/scc53769.2021.9768372\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Identifying emotion from speech has a wide range of applications and has drawn special interests in research to improve the human-computer interaction experience. Traditional machine learning approaches usually face the challenge of selecting the optimal feature set for each application. Deep learning, on the other hand, allows end-to-end development of the models and inherent feature extraction. In this study, we evaluate the performance of Convolutional Neural Network on different kinds of spectral features of acoustic signal collections, from two popular open databases Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) and Berlin Database of Emotional Speech (EmoDB). Two-to-eight classes of emotions (RAVDESS) and two-to-seven classes of emotions (EmoDB) are identified by the deep learning model. The results, in terms of unweighted average recall, are 0.888 (two classes) and 0.694 (eight classes) for the RAVDESS dataset. The corresponding results for the EmoDB dataset are 0.993 (two classes) and 0.764 (seven classes)\",\"PeriodicalId\":365845,\"journal\":{\"name\":\"2021 IEEE 2nd International Conference on Signal, Control and Communication (SCC)\",\"volume\":\"90 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-12-20\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"4\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2021 IEEE 2nd International Conference on Signal, Control and Communication (SCC)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/scc53769.2021.9768372\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 IEEE 2nd International Conference on Signal, Control and Communication (SCC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/scc53769.2021.9768372","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Emotion Recognition using Speech Data with Convolutional Neural Network
Identifying emotion from speech has a wide range of applications and has drawn special interests in research to improve the human-computer interaction experience. Traditional machine learning approaches usually face the challenge of selecting the optimal feature set for each application. Deep learning, on the other hand, allows end-to-end development of the models and inherent feature extraction. In this study, we evaluate the performance of Convolutional Neural Network on different kinds of spectral features of acoustic signal collections, from two popular open databases Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) and Berlin Database of Emotional Speech (EmoDB). Two-to-eight classes of emotions (RAVDESS) and two-to-seven classes of emotions (EmoDB) are identified by the deep learning model. The results, in terms of unweighted average recall, are 0.888 (two classes) and 0.694 (eight classes) for the RAVDESS dataset. The corresponding results for the EmoDB dataset are 0.993 (two classes) and 0.764 (seven classes)