Enhanced speech emotion understanding using advanced attention-centric convolutional networks

IF 4.9 2区医学 Q1 ENGINEERING, BIOMEDICAL

Biomedical Signal Processing and Control Pub Date : 2025-05-03 DOI:10.1016/j.bspc.2025.107936

Yingmei Qi , Heming Huang , Huiyun Zhang

{"title":"Enhanced speech emotion understanding using advanced attention-centric convolutional networks","authors":"Yingmei Qi , Heming Huang , Huiyun Zhang","doi":"10.1016/j.bspc.2025.107936","DOIUrl":null,"url":null,"abstract":"<div><div>Speech Emotion Recognition (SER) plays a crucial role in Human-Computer Interaction (HCI) systems, enabling machines to understand and respond to human emotional states. This paper presents an advanced framework leveraging feature fusion and deep learning architectures for robust SER. The proposed model integrates multi-features extracted using techniques such as MFCC, ZCR, and chroma. These features are augmented with statistical summaries including mean, maximum, and minimum values of MFCCs, enhancing the discriminative power of the input representation. The proposed deep learning architecture, Advanced Attention-Centric Convolutional Networks (AACCN), incorporates a hybrid approach combining Multi-Head Attention (MHA) mechanisms with Convolutional Neural Networks (CNNs). MHA is employed to capture intricate dependencies within the input sequences, while CNNs facilitate hierarchical feature learning and spatial modeling of temporal sequences. Batch normalization and dropout are applied to enhance model generalization and mitigate overfitting. Experimental results on benchmark datasets demonstrate that the proposed framework achieves state-of-the-art performance in SER tasks. Results show significant improvements in accuracy, precision, recall, and F1-score metrics compared to baseline models. The effectiveness of feature fusion and the synergy between MHA and CNNs highlight the robustness and scalability of the proposed AACCN model across diverse emotional contexts in speech signals.</div></div>","PeriodicalId":55362,"journal":{"name":"Biomedical Signal Processing and Control","volume":"108 ","pages":"Article 107936"},"PeriodicalIF":4.9000,"publicationDate":"2025-05-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Biomedical Signal Processing and Control","FirstCategoryId":"5","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1746809425004471","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, BIOMEDICAL","Score":null,"Total":0}

引用次数: 0

Abstract

Speech Emotion Recognition (SER) plays a crucial role in Human-Computer Interaction (HCI) systems, enabling machines to understand and respond to human emotional states. This paper presents an advanced framework leveraging feature fusion and deep learning architectures for robust SER. The proposed model integrates multi-features extracted using techniques such as MFCC, ZCR, and chroma. These features are augmented with statistical summaries including mean, maximum, and minimum values of MFCCs, enhancing the discriminative power of the input representation. The proposed deep learning architecture, Advanced Attention-Centric Convolutional Networks (AACCN), incorporates a hybrid approach combining Multi-Head Attention (MHA) mechanisms with Convolutional Neural Networks (CNNs). MHA is employed to capture intricate dependencies within the input sequences, while CNNs facilitate hierarchical feature learning and spatial modeling of temporal sequences. Batch normalization and dropout are applied to enhance model generalization and mitigate overfitting. Experimental results on benchmark datasets demonstrate that the proposed framework achieves state-of-the-art performance in SER tasks. Results show significant improvements in accuracy, precision, recall, and F1-score metrics compared to baseline models. The effectiveness of feature fusion and the synergy between MHA and CNNs highlight the robustness and scalability of the proposed AACCN model across diverse emotional contexts in speech signals.

查看原文本刊更多论文

使用先进的以注意力为中心的卷积网络增强语音情感理解

语音情感识别（SER）在人机交互（HCI）系统中起着至关重要的作用，使机器能够理解和响应人类的情绪状态。本文提出了一种利用特征融合和深度学习架构实现鲁棒SER的高级框架。该模型集成了利用MFCC、ZCR和色度等技术提取的多种特征。这些特征通过统计摘要增强，包括mfccc的平均值、最大值和最小值，增强了输入表示的判别能力。提出的深度学习架构，高级注意中心卷积网络（AACCN），结合了多头注意（MHA）机制和卷积神经网络（cnn）的混合方法。MHA用于捕获输入序列中复杂的依赖关系，而cnn用于分层特征学习和时间序列的空间建模。使用批归一化和dropout来增强模型泛化和减轻过拟合。在基准数据集上的实验结果表明，该框架在SER任务中达到了最先进的性能。结果显示，与基线模型相比，准确度、精密度、召回率和f1评分指标有显著提高。特征融合的有效性以及MHA和cnn之间的协同作用突出了所提出的AACCN模型在语音信号中不同情感背景下的鲁棒性和可扩展性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Biomedical Signal Processing and Control 工程技术-工程：生物医学

CiteScore

9.80

自引率

13.70%

发文量

822

审稿时长

4 months

期刊介绍： Biomedical Signal Processing and Control aims to provide a cross-disciplinary international forum for the interchange of information on research in the measurement and analysis of signals and images in clinical medicine and the biological sciences. Emphasis is placed on contributions dealing with the practical, applications-led research on the use of methods and devices in clinical diagnosis, patient monitoring and management. Biomedical Signal Processing and Control reflects the main areas in which these methods are being used and developed at the interface of both engineering and clinical science. The scope of the journal is defined to include relevant review papers, technical notes, short communications and letters. Tutorial papers and special issues will also be published.