Time-Distributed Attention-Layered Convolution Neural Network with Ensemble Learning using Random Forest Classifier for Speech Emotion Recognition

Q4 Computer Science

International Journal of Information and Communication Technology Pub Date : 2023-01-18 DOI:10.32890/jict2023.22.1.3

Y. Bhanusree, Samayamantula Srinivas Kumar, Anne Koteswara Rao

{"title":"Time-Distributed Attention-Layered Convolution Neural Network with Ensemble Learning using Random Forest Classifier for Speech Emotion Recognition","authors":"Y. Bhanusree, Samayamantula Srinivas Kumar, Anne Koteswara Rao","doi":"10.32890/jict2023.22.1.3","DOIUrl":null,"url":null,"abstract":"Speech Emotion Detection (SER) is a field of identifying human emotions from human speech utterances. Human speech utterancesare a combination of linguistic and non-linguistic information. Nonlinguistic SER provides a generalized solution in human–computerinteraction applications as it overcomes the language barrier. Machine learning and deep learning techniques were previously proposed for classifying emotions using handpicked features. To achieve effective and generalized SER, feature extraction can be performed using deep neural networks and ensemble learning for classification. The proposed model employed a time-distributed attention-layered convolution neural network (TDACNN) for extracting spatiotemporal features at the first stage and a random forest (RF) classifier, which is an ensemble classifier for efficient and generalized classification of emotions, at the second stage. The proposed model was implemented on the RAVDESS and IEMOCAP data corpora and compared with the CNN-SVM and CNN-RF models for SER. The TDACNN-RF model exhibited test classification accuracies of 92.19 percent and 90.27 percent on the RAVDESS and IEMOCAP data corpora, respectively. The experimental results proved that the proposed model is efficient in extracting spatiotemporal features from time-series speech signals and can classify emotions with good accuracy. The class confusion among the emotions was reduced for both data corpora, proving that the model achieved generalization.","PeriodicalId":39396,"journal":{"name":"International Journal of Information and Communication Technology","volume":"8 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2023-01-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Information and Communication Technology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.32890/jict2023.22.1.3","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"Computer Science","Score":null,"Total":0}

引用次数: 0

Abstract

Speech Emotion Detection (SER) is a field of identifying human emotions from human speech utterances. Human speech utterancesare a combination of linguistic and non-linguistic information. Nonlinguistic SER provides a generalized solution in human–computerinteraction applications as it overcomes the language barrier. Machine learning and deep learning techniques were previously proposed for classifying emotions using handpicked features. To achieve effective and generalized SER, feature extraction can be performed using deep neural networks and ensemble learning for classification. The proposed model employed a time-distributed attention-layered convolution neural network (TDACNN) for extracting spatiotemporal features at the first stage and a random forest (RF) classifier, which is an ensemble classifier for efficient and generalized classification of emotions, at the second stage. The proposed model was implemented on the RAVDESS and IEMOCAP data corpora and compared with the CNN-SVM and CNN-RF models for SER. The TDACNN-RF model exhibited test classification accuracies of 92.19 percent and 90.27 percent on the RAVDESS and IEMOCAP data corpora, respectively. The experimental results proved that the proposed model is efficient in extracting spatiotemporal features from time-series speech signals and can classify emotions with good accuracy. The class confusion among the emotions was reduced for both data corpora, proving that the model achieved generalization.

查看原文本刊更多论文

基于随机森林分类器集成学习的时间分布注意力分层卷积神经网络用于语音情感识别

语音情感检测(SER)是一门从人类语音话语中识别人类情感的研究领域。人类的言语是语言信息和非语言信息的结合。非语言SER克服了语言障碍，为人机交互应用提供了一种通用的解决方案。机器学习和深度学习技术之前被提出用于使用精心挑选的特征对情绪进行分类。为了实现有效和广义的SER，特征提取可以使用深度神经网络和集成学习进行分类。该模型在第一阶段采用时间分布的注意力分层卷积神经网络(TDACNN)提取时空特征，在第二阶段采用随机森林分类器(RF)分类器对情绪进行有效和广义的分类。在RAVDESS和IEMOCAP数据语料库上实现了该模型，并与CNN-SVM和CNN-RF模型进行了比较。tdann - rf模型在RAVDESS和IEMOCAP数据语料库上的测试分类准确率分别为92.19%和90.27%。实验结果表明，该模型能够有效地提取时间序列语音信号的时空特征，并能较好地对情绪进行分类。两种语料库都减少了情绪之间的类混淆，证明模型实现了泛化。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

International Journal of Information and Communication Technology Computer Science-Information Systems

CiteScore

0.70

自引率

0.00%

发文量

期刊介绍： IJICT is a refereed journal in the field of information and communication technology (ICT), providing an international forum for professionals, engineers and researchers. IJICT reports the new paradigms in this emerging field of technology and envisions the future developments in the frontier areas. The journal addresses issues for the vertical and horizontal applications in this area. Topics covered include: -Information theory/coding- Information/IT/network security, standards, applications- Internet/web based systems/products- Data mining/warehousing- Network planning, design, administration- Sensor/ad hoc networks- Human-computer intelligent interaction, AI- Computational linguistics, digital speech- Distributed/cooperative media- Interactive communication media/content- Social interaction, mobile communications- Signal representation/processing, image processing- Virtual reality, cyber law, e-governance- Microprocessor interfacing, hardware design- Control of industrial processes, ERP/CRM/SCM