A Hybrid Time-Distributed Deep Neural Architecture for Speech Emotion Recognition

IF 6.4 2区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

International Journal of Neural Systems Pub Date : 2022-05-12 DOI:10.1142/S0129065722500241

J. Lope, M. Graña

{"title":"A Hybrid Time-Distributed Deep Neural Architecture for Speech Emotion Recognition","authors":"J. Lope, M. Graña","doi":"10.1142/S0129065722500241","DOIUrl":null,"url":null,"abstract":"In recent years, speech emotion recognition (SER) has emerged as one of the most active human-machine interaction research areas. Innovative electronic devices, services and applications are increasingly aiming to check the user emotional state either to issue alerts under some predefined conditions or to adapt the system responses to the user emotions. Voice expression is a very rich and noninvasive source of information for emotion assessment. This paper presents a novel SER approach based on that is a hybrid of a time-distributed convolutional neural network (TD-CNN) and a long short-term memory (LSTM) network. Mel-frequency log-power spectrograms (MFLPSs) extracted from audio recordings are parsed by a sliding window that selects the input for the TD-CNN. The TD-CNN transforms the input image data into a sequence of high-level features that are feed to the LSTM, which carries out the overall signal interpretation. In order to reduce overfitting, the MFLPS representation allows innovative image data augmentation techniques that have no immediate equivalent on the original audio signal. Validation of the proposed hybrid architecture achieves an average recognition accuracy of 73.98% on the most widely and hardest publicly distributed database for SER benchmarking. A permutation test confirms that this result is significantly different from random classification ([Formula: see text]). The proposed architecture outperforms state-of-the-art deep learning models as well as conventional machine learning techniques evaluated on the same database trying to identify the same number of emotions.","PeriodicalId":50305,"journal":{"name":"International Journal of Neural Systems","volume":"1 1","pages":"2250024"},"PeriodicalIF":6.4000,"publicationDate":"2022-05-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Neural Systems","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1142/S0129065722500241","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 2

Abstract

In recent years, speech emotion recognition (SER) has emerged as one of the most active human-machine interaction research areas. Innovative electronic devices, services and applications are increasingly aiming to check the user emotional state either to issue alerts under some predefined conditions or to adapt the system responses to the user emotions. Voice expression is a very rich and noninvasive source of information for emotion assessment. This paper presents a novel SER approach based on that is a hybrid of a time-distributed convolutional neural network (TD-CNN) and a long short-term memory (LSTM) network. Mel-frequency log-power spectrograms (MFLPSs) extracted from audio recordings are parsed by a sliding window that selects the input for the TD-CNN. The TD-CNN transforms the input image data into a sequence of high-level features that are feed to the LSTM, which carries out the overall signal interpretation. In order to reduce overfitting, the MFLPS representation allows innovative image data augmentation techniques that have no immediate equivalent on the original audio signal. Validation of the proposed hybrid architecture achieves an average recognition accuracy of 73.98% on the most widely and hardest publicly distributed database for SER benchmarking. A permutation test confirms that this result is significantly different from random classification ([Formula: see text]). The proposed architecture outperforms state-of-the-art deep learning models as well as conventional machine learning techniques evaluated on the same database trying to identify the same number of emotions.

查看原文本刊更多论文

一种用于语音情感识别的混合时间分布深度神经结构

近年来，语音情感识别（SER）已成为最活跃的人机交互研究领域之一。创新的电子设备、服务和应用程序正越来越多地致力于检查用户情绪状态，以便在一些预定义的条件下发出警报，或者使系统响应适应用户情绪。语音表达是情感评估的一个非常丰富和无创的信息来源。本文提出了一种新的SER方法，该方法是时间分布卷积神经网络（TD-CNN）和长短期记忆（LSTM）网络的混合。从音频记录中提取的梅尔频率对数功率谱图（MFLPS）通过选择TD-CNN的输入的滑动窗口来解析。TD-CNN将输入图像数据转换为一系列高级特征，这些特征被馈送到LSTM，LSTM执行整体信号解释。为了减少过拟合，MFLPS表示允许在原始音频信号上没有直接等效物的创新图像数据增强技术。对所提出的混合架构的验证在SER基准测试中最广泛、最难公开分布的数据库上实现了73.98%的平均识别准确率。排列测试证实，这一结果与随机分类显著不同（[公式：见正文]）。所提出的架构优于在同一数据库上评估的最先进的深度学习模型以及传统的机器学习技术，试图识别相同数量的情绪。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

International Journal of Neural Systems 工程技术-计算机：人工智能

CiteScore

11.30

自引率

28.80%

发文量

116

审稿时长

24 months

期刊介绍： The International Journal of Neural Systems is a monthly, rigorously peer-reviewed transdisciplinary journal focusing on information processing in both natural and artificial neural systems. Special interests include machine learning, computational neuroscience and neurology. The journal prioritizes innovative, high-impact articles spanning multiple fields, including neurosciences and computer science and engineering. It adopts an open-minded approach to this multidisciplinary field, serving as a platform for novel ideas and enhanced understanding of collective and cooperative phenomena in computationally capable systems.