Speech Emotion Recognition Using Deep Neural Networks, Transfer Learning, and Ensemble Classification Techniques

IF 3.9 4区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

Romanian Journal of Information Science and Technology Pub Date : 2023-09-28 DOI:10.59277/romjist.2023.3-4.10

Serban MIHALACHE, Dragos BURILEANU

{"title":"Speech Emotion Recognition Using Deep Neural Networks, Transfer Learning, and Ensemble Classification Techniques","authors":"Serban MIHALACHE, Dragos BURILEANU","doi":"10.59277/romjist.2023.3-4.10","DOIUrl":null,"url":null,"abstract":"Speech emotion recognition (SER) is the task of determining the affective content present in speech, a promising research area of great interest in recent years, with important applications especially in the field of forensic speech and law enforcement operations, among others. In this paper, systems based on deep neural networks (DNNs) spanning five levels of complexity are proposed, developed, and tested, including systems leveraging transfer learning (TL) for the top modern image recognition deep learning models, as well as several ensemble classification techniques that lead to significant performance increases. The systems were tested on the most relevant SER datasets: EMODB, CREMAD, and IEMOCAP, in the context of: (i) classification: using the standard full sets of emotion classes, as well as additional negative emotion subsets relevant for forensic speech applications; and (ii) regression: using the continuously valued 2D arousal-valence affect space. The proposed systems achieved state-of-the-art results for the full class subset for EMODB (up to 83% accuracy) and performance comparable to other published research for the full class subsets for CREMAD and IEMOCAP (up to 55% and 62% accuracy). For the class subsets focusing only on negative affective content, the proposed solutions offered top performance vs. previously published state of the art results.","PeriodicalId":54448,"journal":{"name":"Romanian Journal of Information Science and Technology","volume":"58 1","pages":"0"},"PeriodicalIF":3.9000,"publicationDate":"2023-09-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Romanian Journal of Information Science and Technology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.59277/romjist.2023.3-4.10","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}

引用次数: 0

Abstract

Speech emotion recognition (SER) is the task of determining the affective content present in speech, a promising research area of great interest in recent years, with important applications especially in the field of forensic speech and law enforcement operations, among others. In this paper, systems based on deep neural networks (DNNs) spanning five levels of complexity are proposed, developed, and tested, including systems leveraging transfer learning (TL) for the top modern image recognition deep learning models, as well as several ensemble classification techniques that lead to significant performance increases. The systems were tested on the most relevant SER datasets: EMODB, CREMAD, and IEMOCAP, in the context of: (i) classification: using the standard full sets of emotion classes, as well as additional negative emotion subsets relevant for forensic speech applications; and (ii) regression: using the continuously valued 2D arousal-valence affect space. The proposed systems achieved state-of-the-art results for the full class subset for EMODB (up to 83% accuracy) and performance comparable to other published research for the full class subsets for CREMAD and IEMOCAP (up to 55% and 62% accuracy). For the class subsets focusing only on negative affective content, the proposed solutions offered top performance vs. previously published state of the art results.

查看原文本刊更多论文

使用深度神经网络、迁移学习和集成分类技术的语音情感识别

语音情感识别(SER)是确定语音中存在的情感内容的任务，是近年来备受关注的一个有前途的研究领域，特别是在法医语音和执法行动等领域具有重要的应用。本文提出、开发和测试了基于深度神经网络(dnn)的系统，该系统跨越了五个复杂级别，包括利用迁移学习(TL)的顶级现代图像识别深度学习模型的系统，以及几种导致性能显著提高的集成分类技术。这些系统在最相关的SER数据集上进行了测试:EMODB, CREMAD和IEMOCAP，在以下背景下:(i)分类:使用标准的完整情感类别集，以及与法医语音应用相关的额外负面情绪子集;(ii)回归:利用连续值二维唤醒效价影响空间。所提出的系统在EMODB的全类子集上取得了最先进的结果(准确率高达83%)，其性能可与其他已发表的CREMAD和IEMOCAP的全类子集研究(准确率高达55%和62%)相媲美。对于只关注负面情感内容的类子集，所提出的解决方案提供了与先前发布的最先进结果相比的最佳性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Romanian Journal of Information Science and Technology 工程技术-计算机：理论方法

CiteScore

5.50

自引率

8.60%

发文量

审稿时长

>12 weeks

期刊介绍： The primary objective of this journal is the publication of original results of research in information science and technology. There is no restriction on the addressed topics, the only acceptance criterion being the originality and quality of the articles, proved by independent reviewers. Contributions to recently emerging areas are encouraged. Romanian Journal of Information Science and Technology (a publication of the Romanian Academy) is indexed and abstracted in the following Thomson Reuters products and information services: • Science Citation Index Expanded (also known as SciSearch®), • Journal Citation Reports/Science Edition.