基于脉冲神经网络和卷积神经网络的语音情感识别

IF 8 2区计算机科学 Q1 AUTOMATION & CONTROL SYSTEMS

Engineering Applications of Artificial Intelligence Pub Date : 2025-02-22 DOI:10.1016/j.engappai.2025.110314

Chengyan Du, Fu Liu, Bing Kang, Tao Hou

{"title":"基于脉冲神经网络和卷积神经网络的语音情感识别","authors":"Chengyan Du, Fu Liu, Bing Kang, Tao Hou","doi":"10.1016/j.engappai.2025.110314","DOIUrl":null,"url":null,"abstract":"<div><div>There is an urgent need to determine emotions automatically through speech signals to promote the progress of intelligent technology. However, the low accuracy problem isn't solved so far as, this hinders potential applications of Speech Emotion Recognition (SER). One of the most critical reasons for this low accuracy is that subjective emotions are random and generate weak pulse signals; moreover, they are often hidden in audio, video, and text feature which are extracted from speech. Hence, the features may not be discriminative enough to depict subjective emotions. Therefore, a dual-path SER framework is designed in this paper. Added to the traditional Convolutional Neural Network (CNN)-based SER scheme to handle speech emotion features, the Spiking Neural Network (SNN) framework is added to identify the dynamic pulse emotion features and improve the accuracy of SER. At the same time, a Perceptual Neuron Encoding Layer (PNEL) is proposed to enhance the ability to process speech signals. Overall, the experimental results on the interactive emotional dyadic motion capture database (IEMOCAP) databases show that the proposed approach can achieve 65.3% accuracy and excellent performance in solving the SER issues compared to other existing approaches.</div></div>","PeriodicalId":50523,"journal":{"name":"Engineering Applications of Artificial Intelligence","volume":"147 ","pages":"Article 110314"},"PeriodicalIF":8.0000,"publicationDate":"2025-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Speech emotion recognition based on spiking neural network and convolutional neural network\",\"authors\":\"Chengyan Du, Fu Liu, Bing Kang, Tao Hou\",\"doi\":\"10.1016/j.engappai.2025.110314\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>There is an urgent need to determine emotions automatically through speech signals to promote the progress of intelligent technology. However, the low accuracy problem isn't solved so far as, this hinders potential applications of Speech Emotion Recognition (SER). One of the most critical reasons for this low accuracy is that subjective emotions are random and generate weak pulse signals; moreover, they are often hidden in audio, video, and text feature which are extracted from speech. Hence, the features may not be discriminative enough to depict subjective emotions. Therefore, a dual-path SER framework is designed in this paper. Added to the traditional Convolutional Neural Network (CNN)-based SER scheme to handle speech emotion features, the Spiking Neural Network (SNN) framework is added to identify the dynamic pulse emotion features and improve the accuracy of SER. At the same time, a Perceptual Neuron Encoding Layer (PNEL) is proposed to enhance the ability to process speech signals. Overall, the experimental results on the interactive emotional dyadic motion capture database (IEMOCAP) databases show that the proposed approach can achieve 65.3% accuracy and excellent performance in solving the SER issues compared to other existing approaches.</div></div>\",\"PeriodicalId\":50523,\"journal\":{\"name\":\"Engineering Applications of Artificial Intelligence\",\"volume\":\"147 \",\"pages\":\"Article 110314\"},\"PeriodicalIF\":8.0000,\"publicationDate\":\"2025-02-22\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Engineering Applications of Artificial Intelligence\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0952197625003148\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"AUTOMATION & CONTROL SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Engineering Applications of Artificial Intelligence","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0952197625003148","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"AUTOMATION & CONTROL SYSTEMS","Score":null,"Total":0}

引用次数: 0

摘要

迫切需要通过语音信号自动判断情绪，推动智能技术的进步。然而，低准确率问题至今仍未得到解决，这阻碍了语音情感识别（SER）的潜在应用。这种低准确率的最关键原因之一是主观情绪是随机的，产生的脉冲信号很弱；此外，它们往往隐藏在从语音中提取的音频、视频和文本特征中。因此，这些特征可能没有足够的辨别力来描绘主观情绪。因此，本文设计了一个双路径SER框架。在传统的基于卷积神经网络（CNN）的SER处理语音情感特征的基础上，加入了尖峰神经网络（SNN）框架来识别动态脉冲情感特征，提高了SER的准确率。同时，提出了一种感知神经元编码层（PNEL）来增强对语音信号的处理能力。总体而言，在交互式情绪二元动作捕捉数据库（IEMOCAP）上的实验结果表明，与其他现有方法相比，所提出的方法在解决SER问题方面可以达到65.3%的准确率和优异的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Speech emotion recognition based on spiking neural network and convolutional neural network

There is an urgent need to determine emotions automatically through speech signals to promote the progress of intelligent technology. However, the low accuracy problem isn't solved so far as, this hinders potential applications of Speech Emotion Recognition (SER). One of the most critical reasons for this low accuracy is that subjective emotions are random and generate weak pulse signals; moreover, they are often hidden in audio, video, and text feature which are extracted from speech. Hence, the features may not be discriminative enough to depict subjective emotions. Therefore, a dual-path SER framework is designed in this paper. Added to the traditional Convolutional Neural Network (CNN)-based SER scheme to handle speech emotion features, the Spiking Neural Network (SNN) framework is added to identify the dynamic pulse emotion features and improve the accuracy of SER. At the same time, a Perceptual Neuron Encoding Layer (PNEL) is proposed to enhance the ability to process speech signals. Overall, the experimental results on the interactive emotional dyadic motion capture database (IEMOCAP) databases show that the proposed approach can achieve 65.3% accuracy and excellent performance in solving the SER issues compared to other existing approaches.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Engineering Applications of Artificial Intelligence 工程技术-工程：电子与电气

CiteScore

9.60

自引率

10.00%

发文量

505

审稿时长

68 days

期刊介绍： Artificial Intelligence (AI) is pivotal in driving the fourth industrial revolution, witnessing remarkable advancements across various machine learning methodologies. AI techniques have become indispensable tools for practicing engineers, enabling them to tackle previously insurmountable challenges. Engineering Applications of Artificial Intelligence serves as a global platform for the swift dissemination of research elucidating the practical application of AI methods across all engineering disciplines. Submitted papers are expected to present novel aspects of AI utilized in real-world engineering applications, validated using publicly available datasets to ensure the replicability of research outcomes. Join us in exploring the transformative potential of AI in engineering.