Speech emotion recognition using overlapping sliding window and Shapley additive explainable deep neural network

IF 2.7 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS

Journal of Information and Telecommunication Pub Date : 2023-03-17 DOI:10.1080/24751839.2023.2187278

Nhat Truong Pham, Sy Dzung Nguyen, Vu Song Thuy Nguyen, Bich Ngoc Hong Pham, Duc Ngoc Minh Dang

{"title":"Speech emotion recognition using overlapping sliding window and Shapley additive explainable deep neural network","authors":"Nhat Truong Pham, Sy Dzung Nguyen, Vu Song Thuy Nguyen, Bich Ngoc Hong Pham, Duc Ngoc Minh Dang","doi":"10.1080/24751839.2023.2187278","DOIUrl":null,"url":null,"abstract":"ABSTRACT Speech emotion recognition (SER) has several applications, such as e-learning, human-computer interaction, customer service, and healthcare systems. Although researchers have investigated lots of techniques to improve the accuracy of SER, it has been challenging with feature extraction, classifier schemes, and computational costs. To address the aforementioned problems, we propose a new set of 1D features extracted by using an overlapping sliding window (OSW) technique for SER in this study. In addition, a deep neural network-based classifier scheme called the deep Pattern Recognition Network (PRN) is designed to categorize emotional states from the new set of 1D features. We evaluate the proposed method on the Emo-DB and the AESSD datasets that contain several different emotional states. The experimental results show that the proposed method achieves an accuracy of 98.5% and 87.1% on the Emo-DB and AESSD datasets, respectively. It is also more comparable with accuracy to and better than the state-of-the-art and current approaches that use 1D features on the same datasets for SER. Furthermore, the SHAP (SHapley Additive exPlanations) analysis is employed for interpreting the prediction model to assist system developers in selecting the optimal features to integrate into the desired system.","PeriodicalId":32180,"journal":{"name":"Journal of Information and Telecommunication","volume":"7 1","pages":"317 - 335"},"PeriodicalIF":2.7000,"publicationDate":"2023-03-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Information and Telecommunication","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1080/24751839.2023.2187278","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 2

Abstract

ABSTRACT Speech emotion recognition (SER) has several applications, such as e-learning, human-computer interaction, customer service, and healthcare systems. Although researchers have investigated lots of techniques to improve the accuracy of SER, it has been challenging with feature extraction, classifier schemes, and computational costs. To address the aforementioned problems, we propose a new set of 1D features extracted by using an overlapping sliding window (OSW) technique for SER in this study. In addition, a deep neural network-based classifier scheme called the deep Pattern Recognition Network (PRN) is designed to categorize emotional states from the new set of 1D features. We evaluate the proposed method on the Emo-DB and the AESSD datasets that contain several different emotional states. The experimental results show that the proposed method achieves an accuracy of 98.5% and 87.1% on the Emo-DB and AESSD datasets, respectively. It is also more comparable with accuracy to and better than the state-of-the-art and current approaches that use 1D features on the same datasets for SER. Furthermore, the SHAP (SHapley Additive exPlanations) analysis is employed for interpreting the prediction model to assist system developers in selecting the optimal features to integrate into the desired system.

查看原文本刊更多论文

基于重叠滑动窗和Shapley加性可解释深度神经网络的语音情感识别

摘要语音情感识别（SER）具有多种应用，如电子学习、人机交互、客户服务和医疗保健系统。尽管研究人员已经研究了许多提高SER准确性的技术，但在特征提取、分类器方案和计算成本方面一直存在挑战。为了解决上述问题，我们在本研究中提出了一组新的1D特征，通过使用重叠滑动窗口（OSW）技术提取SER。此外，还设计了一种基于深度神经网络的分类器方案，称为深度模式识别网络（PRN），用于从新的1D特征集中对情绪状态进行分类。我们在包含几种不同情绪状态的Emo DB和AESSD数据集上评估了所提出的方法。实验结果表明，该方法在Emo-DB和AESSD数据集上的准确率分别为98.5%和87.1%。它在精度上也比在SER的相同数据集上使用1D特征的最先进和当前方法更具可比性，并且更好。此外，SHAP（SHapley Additive exPlanations）分析被用于解释预测模型，以帮助系统开发人员选择最佳特征以集成到期望的系统中。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊