Time-Frequency Image-based Speech Emotion Recognition using Artificial Neural Network

Journal of Ravishankar University (PART-B) Pub Date : 2023-12-31 DOI:10.52228/jrub.2023-36-2-10

Neha Dewangan, K. Thakur, S. Mandal, BikeshKumar Singh

{"title":"Time-Frequency Image-based Speech Emotion Recognition using Artificial Neural Network","authors":"Neha Dewangan, K. Thakur, S. Mandal, BikeshKumar Singh","doi":"10.52228/jrub.2023-36-2-10","DOIUrl":null,"url":null,"abstract":"Automatic Speech Emotion Recognition (ASER) is a state-of-the-art application in artificial intelligence. Speech recognition intelligence is employed in various applications such as digital assistance, security, and other human-machine interactive products. In the present work, three open-source acoustic datasets, namely SAVEE, RAVDESS, and EmoDB, have been utilized (Haq et al., 2008, Livingstone et al., 2005, Burkhardt et al., 2005). From these datasets, six emotions namely anger, disgust, fear, happy, neutral, and sad, are selected for automatic speech emotion recognition. Various types of algorithms are already reported for extracting emotional content from acoustic signals. This work proposes a time-frequency (t-f) image-based multiclass speech emotion classification model for the six emotions mentioned above. The proposed model extracts 472 grayscale image features from the t-f images of speech signals. The t-f image is a visual representation of the time component and frequency component at that time in the two-dimensional space, and differing colors show its amplitude. An artificial neural network-based multiclass machine learning approach is used to classify selected emotions. The experimental results show that the above-mentioned emotions' average classification accuracy (CA) of 88.6%, 85.5%, and 93.56% is achieved using SAVEE, RAVDESS, and EmoDB datasets, respectively. Also, an average CA of 83.44% has been achieved for the combination of all three datasets. The maximum reported average classification accuracy (CA) using spectrogram for SAVEE, RAVDESS, and EmoDB dataset is 87.8%, 79.5 %, and 83.4%, respectively (Wani et al., 2020, Mustaqeem and Kwon, 2019, Badshah et al., 2017). The proposed t-f image-based classification model shows improvement in average CA by 0.91%, 7.54%, and 12.18 % for SAVEE, RAVDESS, and EmoDB datasets, respectively. This study can be helpful in human-computer interface applications to detect emotions precisely from acoustic signals.","PeriodicalId":17214,"journal":{"name":"Journal of Ravishankar University (PART-B)","volume":" 10","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2023-12-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Ravishankar University (PART-B)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.52228/jrub.2023-36-2-10","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Automatic Speech Emotion Recognition (ASER) is a state-of-the-art application in artificial intelligence. Speech recognition intelligence is employed in various applications such as digital assistance, security, and other human-machine interactive products. In the present work, three open-source acoustic datasets, namely SAVEE, RAVDESS, and EmoDB, have been utilized (Haq et al., 2008, Livingstone et al., 2005, Burkhardt et al., 2005). From these datasets, six emotions namely anger, disgust, fear, happy, neutral, and sad, are selected for automatic speech emotion recognition. Various types of algorithms are already reported for extracting emotional content from acoustic signals. This work proposes a time-frequency (t-f) image-based multiclass speech emotion classification model for the six emotions mentioned above. The proposed model extracts 472 grayscale image features from the t-f images of speech signals. The t-f image is a visual representation of the time component and frequency component at that time in the two-dimensional space, and differing colors show its amplitude. An artificial neural network-based multiclass machine learning approach is used to classify selected emotions. The experimental results show that the above-mentioned emotions' average classification accuracy (CA) of 88.6%, 85.5%, and 93.56% is achieved using SAVEE, RAVDESS, and EmoDB datasets, respectively. Also, an average CA of 83.44% has been achieved for the combination of all three datasets. The maximum reported average classification accuracy (CA) using spectrogram for SAVEE, RAVDESS, and EmoDB dataset is 87.8%, 79.5 %, and 83.4%, respectively (Wani et al., 2020, Mustaqeem and Kwon, 2019, Badshah et al., 2017). The proposed t-f image-based classification model shows improvement in average CA by 0.91%, 7.54%, and 12.18 % for SAVEE, RAVDESS, and EmoDB datasets, respectively. This study can be helpful in human-computer interface applications to detect emotions precisely from acoustic signals.

查看原文本刊更多论文

利用人工神经网络进行基于时频图像的语音情感识别

自动语音情感识别（ASER）是人工智能领域最先进的应用。语音识别智能被广泛应用于数字辅助、安全和其他人机交互产品中。本研究利用了三个开源声学数据集，即 SAVEE、RAVDESS 和 EmoDB（Haq 等人，2008 年；Livingstone 等人，2005 年；Burkhardt 等人，2005 年）。从这些数据集中，我们选择了六种情绪，即愤怒、厌恶、恐惧、快乐、中性和悲伤，用于自动语音情绪识别。从声音信号中提取情感内容的算法种类繁多。本作品针对上述六种情绪提出了一种基于时间频率（t-f）图像的多类语音情绪分类模型。该模型从语音信号的 t-f 图像中提取了 472 个灰度图像特征。t-f 图像是时间分量和频率分量在二维空间中的直观表示，不同的颜色表示其振幅。利用基于人工神经网络的多类机器学习方法对选定的情绪进行分类。实验结果表明，使用 SAVEE、RAVDESS 和 EmoDB 数据集，上述情绪的平均分类准确率（CA）分别达到 88.6%、85.5% 和 93.56%。此外，所有三个数据集的组合平均分类准确率为 83.44%。据报道，使用光谱图对 SAVEE、RAVDESS 和 EmoDB 数据集进行分类的最高平均准确率（CA）分别为 87.8%、79.5 % 和 83.4%（Wani 等人，2020 年；Mustaqeem 和 Kwon，2019 年；Badshah 等人，2017 年）。所提出的基于 t-f 图像的分类模型在 SAVEE、RAVDESS 和 EmoDB 数据集上的平均 CA 分别提高了 0.91%、7.54% 和 12.18%。这项研究有助于人机界面应用从声音信号中精确检测情绪。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal of Ravishankar University (PART-B)

自引率

0.00%

发文量