Neha Dewangan, K. Thakur, S. Mandal, BikeshKumar Singh
{"title":"利用人工神经网络进行基于时频图像的语音情感识别","authors":"Neha Dewangan, K. Thakur, S. Mandal, BikeshKumar Singh","doi":"10.52228/jrub.2023-36-2-10","DOIUrl":null,"url":null,"abstract":"Automatic Speech Emotion Recognition (ASER) is a state-of-the-art application in artificial intelligence. Speech recognition intelligence is employed in various applications such as digital assistance, security, and other human-machine interactive products. In the present work, three open-source acoustic datasets, namely SAVEE, RAVDESS, and EmoDB, have been utilized (Haq et al., 2008, Livingstone et al., 2005, Burkhardt et al., 2005). From these datasets, six emotions namely anger, disgust, fear, happy, neutral, and sad, are selected for automatic speech emotion recognition. Various types of algorithms are already reported for extracting emotional content from acoustic signals. This work proposes a time-frequency (t-f) image-based multiclass speech emotion classification model for the six emotions mentioned above. The proposed model extracts 472 grayscale image features from the t-f images of speech signals. The t-f image is a visual representation of the time component and frequency component at that time in the two-dimensional space, and differing colors show its amplitude. An artificial neural network-based multiclass machine learning approach is used to classify selected emotions. The experimental results show that the above-mentioned emotions' average classification accuracy (CA) of 88.6%, 85.5%, and 93.56% is achieved using SAVEE, RAVDESS, and EmoDB datasets, respectively. Also, an average CA of 83.44% has been achieved for the combination of all three datasets. The maximum reported average classification accuracy (CA) using spectrogram for SAVEE, RAVDESS, and EmoDB dataset is 87.8%, 79.5 %, and 83.4%, respectively (Wani et al., 2020, Mustaqeem and Kwon, 2019, Badshah et al., 2017). The proposed t-f image-based classification model shows improvement in average CA by 0.91%, 7.54%, and 12.18 % for SAVEE, RAVDESS, and EmoDB datasets, respectively. This study can be helpful in human-computer interface applications to detect emotions precisely from acoustic signals.","PeriodicalId":17214,"journal":{"name":"Journal of Ravishankar University (PART-B)","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2023-12-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Time-Frequency Image-based Speech Emotion Recognition using Artificial Neural Network\",\"authors\":\"Neha Dewangan, K. Thakur, S. Mandal, BikeshKumar Singh\",\"doi\":\"10.52228/jrub.2023-36-2-10\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Automatic Speech Emotion Recognition (ASER) is a state-of-the-art application in artificial intelligence. Speech recognition intelligence is employed in various applications such as digital assistance, security, and other human-machine interactive products. In the present work, three open-source acoustic datasets, namely SAVEE, RAVDESS, and EmoDB, have been utilized (Haq et al., 2008, Livingstone et al., 2005, Burkhardt et al., 2005). From these datasets, six emotions namely anger, disgust, fear, happy, neutral, and sad, are selected for automatic speech emotion recognition. Various types of algorithms are already reported for extracting emotional content from acoustic signals. This work proposes a time-frequency (t-f) image-based multiclass speech emotion classification model for the six emotions mentioned above. The proposed model extracts 472 grayscale image features from the t-f images of speech signals. The t-f image is a visual representation of the time component and frequency component at that time in the two-dimensional space, and differing colors show its amplitude. An artificial neural network-based multiclass machine learning approach is used to classify selected emotions. The experimental results show that the above-mentioned emotions' average classification accuracy (CA) of 88.6%, 85.5%, and 93.56% is achieved using SAVEE, RAVDESS, and EmoDB datasets, respectively. Also, an average CA of 83.44% has been achieved for the combination of all three datasets. The maximum reported average classification accuracy (CA) using spectrogram for SAVEE, RAVDESS, and EmoDB dataset is 87.8%, 79.5 %, and 83.4%, respectively (Wani et al., 2020, Mustaqeem and Kwon, 2019, Badshah et al., 2017). The proposed t-f image-based classification model shows improvement in average CA by 0.91%, 7.54%, and 12.18 % for SAVEE, RAVDESS, and EmoDB datasets, respectively. This study can be helpful in human-computer interface applications to detect emotions precisely from acoustic signals.\",\"PeriodicalId\":17214,\"journal\":{\"name\":\"Journal of Ravishankar University (PART-B)\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-12-31\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Ravishankar University (PART-B)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.52228/jrub.2023-36-2-10\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Ravishankar University (PART-B)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.52228/jrub.2023-36-2-10","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Time-Frequency Image-based Speech Emotion Recognition using Artificial Neural Network
Automatic Speech Emotion Recognition (ASER) is a state-of-the-art application in artificial intelligence. Speech recognition intelligence is employed in various applications such as digital assistance, security, and other human-machine interactive products. In the present work, three open-source acoustic datasets, namely SAVEE, RAVDESS, and EmoDB, have been utilized (Haq et al., 2008, Livingstone et al., 2005, Burkhardt et al., 2005). From these datasets, six emotions namely anger, disgust, fear, happy, neutral, and sad, are selected for automatic speech emotion recognition. Various types of algorithms are already reported for extracting emotional content from acoustic signals. This work proposes a time-frequency (t-f) image-based multiclass speech emotion classification model for the six emotions mentioned above. The proposed model extracts 472 grayscale image features from the t-f images of speech signals. The t-f image is a visual representation of the time component and frequency component at that time in the two-dimensional space, and differing colors show its amplitude. An artificial neural network-based multiclass machine learning approach is used to classify selected emotions. The experimental results show that the above-mentioned emotions' average classification accuracy (CA) of 88.6%, 85.5%, and 93.56% is achieved using SAVEE, RAVDESS, and EmoDB datasets, respectively. Also, an average CA of 83.44% has been achieved for the combination of all three datasets. The maximum reported average classification accuracy (CA) using spectrogram for SAVEE, RAVDESS, and EmoDB dataset is 87.8%, 79.5 %, and 83.4%, respectively (Wani et al., 2020, Mustaqeem and Kwon, 2019, Badshah et al., 2017). The proposed t-f image-based classification model shows improvement in average CA by 0.91%, 7.54%, and 12.18 % for SAVEE, RAVDESS, and EmoDB datasets, respectively. This study can be helpful in human-computer interface applications to detect emotions precisely from acoustic signals.