Reproducible and generalizable speech emotion recognition via an Intelligent Fusion Network

IF 4.9 2区 医学 Q1 ENGINEERING, BIOMEDICAL
Huiyun Zhang , Puyang Zhao , Gaigai Tang , Zongjin Li , Zhu Yuan
{"title":"Reproducible and generalizable speech emotion recognition via an Intelligent Fusion Network","authors":"Huiyun Zhang ,&nbsp;Puyang Zhao ,&nbsp;Gaigai Tang ,&nbsp;Zongjin Li ,&nbsp;Zhu Yuan","doi":"10.1016/j.bspc.2025.107996","DOIUrl":null,"url":null,"abstract":"<div><div>Speech emotion recognition (SER) is a critical aspect of enhancing the naturalness and effectiveness of Human-computer interaction systems. Despite substantial progress through deep learning techniques, challenges remain, particularly concerning model performance, reproducibility, and generalization. To address these challenges, we propose the Intelligent Fusion Network (IFN), a novel framework designed to improve emotion recognition by leveraging an isomorphic architecture and attention-based mechanisms. The IFN framework consists of five key components: an input processing layer, a feature mapping module, a dual attention mechanism, a convolutional feature refinement module, and a multiplicative fusion module, culminating in an output layer. In addition, we introduce a robust methodology for quantifying and assessing the reproducibility of deep learning models, ensuring consistent and reliable evaluations. Extensive experiments conducted across six benchmark datasets—EMODB, CASIA, SAVEE, BodEMODB, IEMOCAP, and ESD—demonstrate the superior performance of IFN. Specifically, IFN achieves an accuracy of 96.31 % on the ESD dataset, surpassing the leading baseline by 2.70 %. On the more challenging IEMOCAP dataset, IFN attains an accuracy of 64.32 %, highlighting its ability to generalize effectively across diverse datasets. Furthermore, IFN demonstrates exceptional reproducibility, with general and correct reproducibility rates of 86.69 % and 86.34 %, respectively, at k = 10 on the ESD dataset, significantly outperforming existing approaches. These results highlight IFN as a highly reliable and effective solution for advancing SER, offering the potential to enable more intuitive and efficient Human-Computer Interaction (HCI) systems.</div></div>","PeriodicalId":55362,"journal":{"name":"Biomedical Signal Processing and Control","volume":"109 ","pages":"Article 107996"},"PeriodicalIF":4.9000,"publicationDate":"2025-05-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Biomedical Signal Processing and Control","FirstCategoryId":"5","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1746809425005075","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, BIOMEDICAL","Score":null,"Total":0}
引用次数: 0

Abstract

Speech emotion recognition (SER) is a critical aspect of enhancing the naturalness and effectiveness of Human-computer interaction systems. Despite substantial progress through deep learning techniques, challenges remain, particularly concerning model performance, reproducibility, and generalization. To address these challenges, we propose the Intelligent Fusion Network (IFN), a novel framework designed to improve emotion recognition by leveraging an isomorphic architecture and attention-based mechanisms. The IFN framework consists of five key components: an input processing layer, a feature mapping module, a dual attention mechanism, a convolutional feature refinement module, and a multiplicative fusion module, culminating in an output layer. In addition, we introduce a robust methodology for quantifying and assessing the reproducibility of deep learning models, ensuring consistent and reliable evaluations. Extensive experiments conducted across six benchmark datasets—EMODB, CASIA, SAVEE, BodEMODB, IEMOCAP, and ESD—demonstrate the superior performance of IFN. Specifically, IFN achieves an accuracy of 96.31 % on the ESD dataset, surpassing the leading baseline by 2.70 %. On the more challenging IEMOCAP dataset, IFN attains an accuracy of 64.32 %, highlighting its ability to generalize effectively across diverse datasets. Furthermore, IFN demonstrates exceptional reproducibility, with general and correct reproducibility rates of 86.69 % and 86.34 %, respectively, at k = 10 on the ESD dataset, significantly outperforming existing approaches. These results highlight IFN as a highly reliable and effective solution for advancing SER, offering the potential to enable more intuitive and efficient Human-Computer Interaction (HCI) systems.
基于智能融合网络的可重复和可概括的语音情感识别
语音情感识别是提高人机交互系统的自然度和有效性的一个重要方面。尽管深度学习技术取得了实质性进展,但挑战依然存在,特别是在模型性能、可重复性和泛化方面。为了应对这些挑战,我们提出了智能融合网络(IFN),这是一个新的框架,旨在通过利用同构架构和基于注意力的机制来改善情绪识别。IFN框架由五个关键组件组成:输入处理层、特征映射模块、双注意机制、卷积特征细化模块和乘法融合模块,最终形成输出层。此外,我们引入了一种强大的方法来量化和评估深度学习模型的可重复性,确保评估的一致性和可靠性。在六个基准数据集(emodb、CASIA、SAVEE、BodEMODB、IEMOCAP和esd)上进行的大量实验证明了IFN的优越性能。具体来说,IFN在ESD数据集上的准确率达到96.31%,比领先基线高出2.70%。在更具挑战性的IEMOCAP数据集上,IFN达到了64.32%的准确率,突出了其在不同数据集上有效泛化的能力。此外,IFN具有出色的再现性,在ESD数据集上,当k = 10时,其一般再现率和正确再现率分别为86.69%和86.34%,显著优于现有方法。这些结果突出了IFN作为推进SER的高度可靠和有效的解决方案,提供了实现更直观和高效的人机交互(HCI)系统的潜力。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Biomedical Signal Processing and Control
Biomedical Signal Processing and Control 工程技术-工程:生物医学
CiteScore
9.80
自引率
13.70%
发文量
822
审稿时长
4 months
期刊介绍: Biomedical Signal Processing and Control aims to provide a cross-disciplinary international forum for the interchange of information on research in the measurement and analysis of signals and images in clinical medicine and the biological sciences. Emphasis is placed on contributions dealing with the practical, applications-led research on the use of methods and devices in clinical diagnosis, patient monitoring and management. Biomedical Signal Processing and Control reflects the main areas in which these methods are being used and developed at the interface of both engineering and clinical science. The scope of the journal is defined to include relevant review papers, technical notes, short communications and letters. Tutorial papers and special issues will also be published.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信