Reproducible and generalizable speech emotion recognition via an Intelligent Fusion Network

IF 4.9 2区医学 Q1 ENGINEERING, BIOMEDICAL

Biomedical Signal Processing and Control Pub Date : 2025-05-08 DOI:10.1016/j.bspc.2025.107996

Huiyun Zhang , Puyang Zhao , Gaigai Tang , Zongjin Li , Zhu Yuan

{"title":"Reproducible and generalizable speech emotion recognition via an Intelligent Fusion Network","authors":"Huiyun Zhang , Puyang Zhao , Gaigai Tang , Zongjin Li , Zhu Yuan","doi":"10.1016/j.bspc.2025.107996","DOIUrl":null,"url":null,"abstract":"<div><div>Speech emotion recognition (SER) is a critical aspect of enhancing the naturalness and effectiveness of Human-computer interaction systems. Despite substantial progress through deep learning techniques, challenges remain, particularly concerning model performance, reproducibility, and generalization. To address these challenges, we propose the Intelligent Fusion Network (IFN), a novel framework designed to improve emotion recognition by leveraging an isomorphic architecture and attention-based mechanisms. The IFN framework consists of five key components: an input processing layer, a feature mapping module, a dual attention mechanism, a convolutional feature refinement module, and a multiplicative fusion module, culminating in an output layer. In addition, we introduce a robust methodology for quantifying and assessing the reproducibility of deep learning models, ensuring consistent and reliable evaluations. Extensive experiments conducted across six benchmark datasets—EMODB, CASIA, SAVEE, BodEMODB, IEMOCAP, and ESD—demonstrate the superior performance of IFN. Specifically, IFN achieves an accuracy of 96.31 % on the ESD dataset, surpassing the leading baseline by 2.70 %. On the more challenging IEMOCAP dataset, IFN attains an accuracy of 64.32 %, highlighting its ability to generalize effectively across diverse datasets. Furthermore, IFN demonstrates exceptional reproducibility, with general and correct reproducibility rates of 86.69 % and 86.34 %, respectively, at k = 10 on the ESD dataset, significantly outperforming existing approaches. These results highlight IFN as a highly reliable and effective solution for advancing SER, offering the potential to enable more intuitive and efficient Human-Computer Interaction (HCI) systems.</div></div>","PeriodicalId":55362,"journal":{"name":"Biomedical Signal Processing and Control","volume":"109 ","pages":"Article 107996"},"PeriodicalIF":4.9000,"publicationDate":"2025-05-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Biomedical Signal Processing and Control","FirstCategoryId":"5","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1746809425005075","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, BIOMEDICAL","Score":null,"Total":0}

引用次数: 0

Abstract

Speech emotion recognition (SER) is a critical aspect of enhancing the naturalness and effectiveness of Human-computer interaction systems. Despite substantial progress through deep learning techniques, challenges remain, particularly concerning model performance, reproducibility, and generalization. To address these challenges, we propose the Intelligent Fusion Network (IFN), a novel framework designed to improve emotion recognition by leveraging an isomorphic architecture and attention-based mechanisms. The IFN framework consists of five key components: an input processing layer, a feature mapping module, a dual attention mechanism, a convolutional feature refinement module, and a multiplicative fusion module, culminating in an output layer. In addition, we introduce a robust methodology for quantifying and assessing the reproducibility of deep learning models, ensuring consistent and reliable evaluations. Extensive experiments conducted across six benchmark datasets—EMODB, CASIA, SAVEE, BodEMODB, IEMOCAP, and ESD—demonstrate the superior performance of IFN. Specifically, IFN achieves an accuracy of 96.31 % on the ESD dataset, surpassing the leading baseline by 2.70 %. On the more challenging IEMOCAP dataset, IFN attains an accuracy of 64.32 %, highlighting its ability to generalize effectively across diverse datasets. Furthermore, IFN demonstrates exceptional reproducibility, with general and correct reproducibility rates of 86.69 % and 86.34 %, respectively, at k = 10 on the ESD dataset, significantly outperforming existing approaches. These results highlight IFN as a highly reliable and effective solution for advancing SER, offering the potential to enable more intuitive and efficient Human-Computer Interaction (HCI) systems.

查看原文本刊更多论文

基于智能融合网络的可重复和可概括的语音情感识别

语音情感识别是提高人机交互系统的自然度和有效性的一个重要方面。尽管深度学习技术取得了实质性进展，但挑战依然存在，特别是在模型性能、可重复性和泛化方面。为了应对这些挑战，我们提出了智能融合网络（IFN），这是一个新的框架，旨在通过利用同构架构和基于注意力的机制来改善情绪识别。IFN框架由五个关键组件组成：输入处理层、特征映射模块、双注意机制、卷积特征细化模块和乘法融合模块，最终形成输出层。此外，我们引入了一种强大的方法来量化和评估深度学习模型的可重复性，确保评估的一致性和可靠性。在六个基准数据集（emodb、CASIA、SAVEE、BodEMODB、IEMOCAP和esd）上进行的大量实验证明了IFN的优越性能。具体来说，IFN在ESD数据集上的准确率达到96.31%，比领先基线高出2.70%。在更具挑战性的IEMOCAP数据集上，IFN达到了64.32%的准确率，突出了其在不同数据集上有效泛化的能力。此外，IFN具有出色的再现性，在ESD数据集上，当k = 10时，其一般再现率和正确再现率分别为86.69%和86.34%，显著优于现有方法。这些结果突出了IFN作为推进SER的高度可靠和有效的解决方案，提供了实现更直观和高效的人机交互（HCI）系统的潜力。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Biomedical Signal Processing and Control 工程技术-工程：生物医学

CiteScore

9.80

自引率

13.70%

发文量

822

审稿时长

4 months

期刊介绍： Biomedical Signal Processing and Control aims to provide a cross-disciplinary international forum for the interchange of information on research in the measurement and analysis of signals and images in clinical medicine and the biological sciences. Emphasis is placed on contributions dealing with the practical, applications-led research on the use of methods and devices in clinical diagnosis, patient monitoring and management. Biomedical Signal Processing and Control reflects the main areas in which these methods are being used and developed at the interface of both engineering and clinical science. The scope of the journal is defined to include relevant review papers, technical notes, short communications and letters. Tutorial papers and special issues will also be published.