基于固定频率范围经验小波变换的语音情感识别声学和熵特征

IF 2.4 3区计算机科学 Q2 ACOUSTICS

Speech Communication Pub Date : 2024-11-14 DOI:10.1016/j.specom.2024.103148

Siba Prasad Mishra, Pankaj Warule, Suman Deb

{"title":"基于固定频率范围经验小波变换的语音情感识别声学和熵特征","authors":"Siba Prasad Mishra, Pankaj Warule, Suman Deb","doi":"10.1016/j.specom.2024.103148","DOIUrl":null,"url":null,"abstract":"<div><div>The primary goal of automated speech emotion recognition (SER) is to accurately and effectively identify each specific emotion conveyed in a speech signal utilizing machines such as computers and mobile devices. The widespread recognition of the popularity of SER among academics for three decades is mainly attributed to its broad application in practical scenarios. The utilization of SER has proven to be beneficial in various fields, such as medical intervention, bolstering safety strategies, conducting vigil functions, enhancing online search engines, enhancing road safety, managing customer relationships, strengthening the connection between machines and humans, and numerous other domains. Many researchers have used diverse methodologies, such as the integration of different attributes, the use of different feature selection techniques, and designed a hybrid or complex model using more than one classifier, to augment the effectiveness of emotion classification. In our study, we used a novel technique called the fixed frequency range empirical wavelet transform (FFREWT) filter bank decomposition method to extract the features, and then used those features to accurately identify each and every emotion in the speech signal. The FFREWT filter bank method segments the speech signal frame (SSF) into many sub-signals or modes. We used each FFREWT-based decomposed mode to get features like the mel frequency cepstral coefficient (MFCC), approximate entropy (ApEn), permutation entropy (PrEn), and increment entropy (IrEn). We then used the different combinations of the proposed FFREWT-based feature sets and the deep neural network (DNN) classifier to classify the speech emotion. Our proposed method helps to achieve an emotion classification accuracy of 89.35%, 84.69%, and 100% using the combinations of the proposed FFREWT-based feature (MFCC + ApEn + PrEn + IrEn) for the EMO-DB, EMOVO, and TESS datasets, respectively. Our experimental results were compared with the other methods, and we found that the proposed FFREWT-based feature combinations with a DNN classifier performed better than the state-of-the-art methods in SER.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"166 ","pages":"Article 103148"},"PeriodicalIF":2.4000,"publicationDate":"2024-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Fixed frequency range empirical wavelet transform based acoustic and entropy features for speech emotion recognition\",\"authors\":\"Siba Prasad Mishra, Pankaj Warule, Suman Deb\",\"doi\":\"10.1016/j.specom.2024.103148\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>The primary goal of automated speech emotion recognition (SER) is to accurately and effectively identify each specific emotion conveyed in a speech signal utilizing machines such as computers and mobile devices. The widespread recognition of the popularity of SER among academics for three decades is mainly attributed to its broad application in practical scenarios. The utilization of SER has proven to be beneficial in various fields, such as medical intervention, bolstering safety strategies, conducting vigil functions, enhancing online search engines, enhancing road safety, managing customer relationships, strengthening the connection between machines and humans, and numerous other domains. Many researchers have used diverse methodologies, such as the integration of different attributes, the use of different feature selection techniques, and designed a hybrid or complex model using more than one classifier, to augment the effectiveness of emotion classification. In our study, we used a novel technique called the fixed frequency range empirical wavelet transform (FFREWT) filter bank decomposition method to extract the features, and then used those features to accurately identify each and every emotion in the speech signal. The FFREWT filter bank method segments the speech signal frame (SSF) into many sub-signals or modes. We used each FFREWT-based decomposed mode to get features like the mel frequency cepstral coefficient (MFCC), approximate entropy (ApEn), permutation entropy (PrEn), and increment entropy (IrEn). We then used the different combinations of the proposed FFREWT-based feature sets and the deep neural network (DNN) classifier to classify the speech emotion. Our proposed method helps to achieve an emotion classification accuracy of 89.35%, 84.69%, and 100% using the combinations of the proposed FFREWT-based feature (MFCC + ApEn + PrEn + IrEn) for the EMO-DB, EMOVO, and TESS datasets, respectively. Our experimental results were compared with the other methods, and we found that the proposed FFREWT-based feature combinations with a DNN classifier performed better than the state-of-the-art methods in SER.</div></div>\",\"PeriodicalId\":49485,\"journal\":{\"name\":\"Speech Communication\",\"volume\":\"166 \",\"pages\":\"Article 103148\"},\"PeriodicalIF\":2.4000,\"publicationDate\":\"2024-11-14\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Speech Communication\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0167639324001195\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"ACOUSTICS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Speech Communication","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0167639324001195","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ACOUSTICS","Score":null,"Total":0}

引用次数: 0

摘要

自动语音情感识别（SER）的主要目标是利用计算机和移动设备等机器，准确有效地识别语音信号中传达的每一种特定情感。三十年来，SER 在学术界的普及得到了广泛认可，这主要归功于它在实际场景中的广泛应用。事实证明，在医疗干预、加强安全策略、执行警戒功能、增强在线搜索引擎、加强道路安全、管理客户关系、加强机器与人类之间的联系等众多领域，使用 SER 都是有益的。许多研究人员使用了不同的方法，如整合不同的属性、使用不同的特征选择技术、设计使用多个分类器的混合或复杂模型等，以增强情绪分类的有效性。在我们的研究中，我们使用了一种名为固定频率范围经验小波变换（FFREWT）滤波器组分解法的新技术来提取特征，然后利用这些特征来准确识别语音信号中的每一种情绪。FFREWT 滤波器组方法将语音信号帧（SSF）分割成许多子信号或模式。我们使用基于 FFREWT 的每种分解模式来获取特征，如梅尔频率倒频谱系数 (MFCC)、近似熵 (ApEn)、置换熵 (PrEn) 和增量熵 (IrEn)。然后，我们使用所提出的基于 FFREWT 的特征集和深度神经网络 (DNN) 分类器的不同组合来对语音进行情感分类。在 EMO-DB、EMOVO 和 TESS 数据集上，我们提出的方法使用基于 FFREWT 的特征组合（MFCC + ApEn + PrEn + IrEn）分别帮助实现了 89.35%、84.69% 和 100%的情感分类准确率。我们将实验结果与其他方法进行了比较，发现所提出的基于 FFREWT 的特征组合与 DNN 分类器在 SER 中的表现优于最先进的方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Fixed frequency range empirical wavelet transform based acoustic and entropy features for speech emotion recognition

The primary goal of automated speech emotion recognition (SER) is to accurately and effectively identify each specific emotion conveyed in a speech signal utilizing machines such as computers and mobile devices. The widespread recognition of the popularity of SER among academics for three decades is mainly attributed to its broad application in practical scenarios. The utilization of SER has proven to be beneficial in various fields, such as medical intervention, bolstering safety strategies, conducting vigil functions, enhancing online search engines, enhancing road safety, managing customer relationships, strengthening the connection between machines and humans, and numerous other domains. Many researchers have used diverse methodologies, such as the integration of different attributes, the use of different feature selection techniques, and designed a hybrid or complex model using more than one classifier, to augment the effectiveness of emotion classification. In our study, we used a novel technique called the fixed frequency range empirical wavelet transform (FFREWT) filter bank decomposition method to extract the features, and then used those features to accurately identify each and every emotion in the speech signal. The FFREWT filter bank method segments the speech signal frame (SSF) into many sub-signals or modes. We used each FFREWT-based decomposed mode to get features like the mel frequency cepstral coefficient (MFCC), approximate entropy (ApEn), permutation entropy (PrEn), and increment entropy (IrEn). We then used the different combinations of the proposed FFREWT-based feature sets and the deep neural network (DNN) classifier to classify the speech emotion. Our proposed method helps to achieve an emotion classification accuracy of 89.35%, 84.69%, and 100% using the combinations of the proposed FFREWT-based feature (MFCC + ApEn + PrEn + IrEn) for the EMO-DB, EMOVO, and TESS datasets, respectively. Our experimental results were compared with the other methods, and we found that the proposed FFREWT-based feature combinations with a DNN classifier performed better than the state-of-the-art methods in SER.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Speech Communication 工程技术-计算机：跨学科应用

CiteScore

6.80

自引率

6.20%

发文量

审稿时长

19.2 weeks

期刊介绍： Speech Communication is an interdisciplinary journal whose primary objective is to fulfil the need for the rapid dissemination and thorough discussion of basic and applied research results. The journal''s primary objectives are: • to present a forum for the advancement of human and human-machine speech communication science; • to stimulate cross-fertilization between different fields of this domain; • to contribute towards the rapid and wide diffusion of scientifically sound contributions in this domain.