Optimizing Speech Emotion Recognition with Machine Learning Based Advanced Audio Cue Analysis

Technologies Pub Date : 2024-07-11 DOI:10.3390/technologies12070111

Nuwan Pallewela, D. Alahakoon, A. Adikari, John E. Pierce, ML Rose

{"title":"Optimizing Speech Emotion Recognition with Machine Learning Based Advanced Audio Cue Analysis","authors":"Nuwan Pallewela, D. Alahakoon, A. Adikari, John E. Pierce, ML Rose","doi":"10.3390/technologies12070111","DOIUrl":null,"url":null,"abstract":"In today’s fast-paced and interconnected world, where human–computer interaction is an integral component of daily life, the ability to recognize and understand human emotions has emerged as a crucial facet of technological advancement. However, human emotion, a complex interplay of physiological, psychological, and social factors, poses a formidable challenge even for other humans to comprehend accurately. With the emergence of voice assistants and other speech-based applications, it has become essential to improve audio-based emotion expression. However, there is a lack of specificity and agreement in current emotion annotation practice, as evidenced by conflicting labels in many human-annotated emotional datasets for the same speech segments. Previous studies have had to filter out these conflicts and, therefore, a large portion of the collected data has been considered unusable. In this study, we aimed to improve the accuracy of computational prediction of uncertain emotion labels by utilizing high-confidence emotion labelled speech segments from the IEMOCAP emotion dataset. We implemented an audio-based emotion recognition model using bag of audio word encoding (BoAW) to obtain a representation of audio aspects of emotion in speech with state-of-the-art recurrent neural network models. Our approach improved the state-of-the-art audio-based emotion recognition with a 61.09% accuracy rate, an improvement of 1.02% over the BiDialogueRNN model and 1.72% over the EmoCaps multi-modal emotion recognition models. In comparison to human annotation, our approach achieved similar results in identifying positive and negative emotions. Furthermore, it has proven effective in accurately recognizing the sentiment of uncertain emotion segments that were previously considered unusable in other studies. Improvements in audio emotion recognition could have implications in voice-based assistants, healthcare, and other industrial applications that benefit from automated communication.","PeriodicalId":504839,"journal":{"name":"Technologies","volume":"134 9","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Technologies","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3390/technologies12070111","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

In today’s fast-paced and interconnected world, where human–computer interaction is an integral component of daily life, the ability to recognize and understand human emotions has emerged as a crucial facet of technological advancement. However, human emotion, a complex interplay of physiological, psychological, and social factors, poses a formidable challenge even for other humans to comprehend accurately. With the emergence of voice assistants and other speech-based applications, it has become essential to improve audio-based emotion expression. However, there is a lack of specificity and agreement in current emotion annotation practice, as evidenced by conflicting labels in many human-annotated emotional datasets for the same speech segments. Previous studies have had to filter out these conflicts and, therefore, a large portion of the collected data has been considered unusable. In this study, we aimed to improve the accuracy of computational prediction of uncertain emotion labels by utilizing high-confidence emotion labelled speech segments from the IEMOCAP emotion dataset. We implemented an audio-based emotion recognition model using bag of audio word encoding (BoAW) to obtain a representation of audio aspects of emotion in speech with state-of-the-art recurrent neural network models. Our approach improved the state-of-the-art audio-based emotion recognition with a 61.09% accuracy rate, an improvement of 1.02% over the BiDialogueRNN model and 1.72% over the EmoCaps multi-modal emotion recognition models. In comparison to human annotation, our approach achieved similar results in identifying positive and negative emotions. Furthermore, it has proven effective in accurately recognizing the sentiment of uncertain emotion segments that were previously considered unusable in other studies. Improvements in audio emotion recognition could have implications in voice-based assistants, healthcare, and other industrial applications that benefit from automated communication.

查看原文本刊更多论文

利用基于机器学习的高级音频线索分析优化语音情感识别

在当今这个快节奏、互联互通的世界里，人机交互已成为日常生活不可或缺的组成部分，因此识别和理解人类情绪的能力已成为技术进步的一个重要方面。然而，人类的情感是生理、心理和社会因素的复杂交织，即使是人类也难以准确理解。随着语音助手和其他基于语音的应用的出现，改进基于音频的情感表达变得至关重要。然而，目前的情感注释实践缺乏具体性和一致性，许多人类注释的情感数据集对相同语音片段的标签相互矛盾就是证明。以往的研究不得不过滤掉这些冲突，因此收集到的数据中有很大一部分被认为是不可用的。在本研究中，我们旨在利用 IEMOCAP 情感数据集中的高置信度情感标签语音片段，提高不确定情感标签的计算预测准确度。我们使用音频词编码袋（BoAW）实现了基于音频的情感识别模型，从而通过最先进的递归神经网络模型获得了语音中情感音频方面的表征。我们的方法改进了最先进的基于音频的情感识别，准确率达到 61.09%，比 BiDialogueRNN 模型提高了 1.02%，比 EmoCaps 多模态情感识别模型提高了 1.72%。与人工注释相比，我们的方法在识别正面和负面情绪方面取得了相似的结果。此外，我们的方法还能准确识别不确定情绪片段的情感，这在其他研究中被认为是不可用的。音频情感识别的改进可能会对基于语音的助手、医疗保健和其他受益于自动通信的行业应用产生影响。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Technologies

自引率

0.00%

发文量