利用门控跨模态注意和多模态同质特征差异学习改进语音情感识别

IF 6.6 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE
Feng Li , Jiusong Luo , Wanjun Xia
{"title":"利用门控跨模态注意和多模态同质特征差异学习改进语音情感识别","authors":"Feng Li ,&nbsp;Jiusong Luo ,&nbsp;Wanjun Xia","doi":"10.1016/j.asoc.2025.113915","DOIUrl":null,"url":null,"abstract":"<div><div>Speech emotion recognition (SER) remains a significant and crucial challenge due to the complex and multifaceted nature of human emotions. To tackle this challenge, researchers strive to integrate information from diverse modalities through multimodal learning. However, existing multimodal fusion techniques often overlook the intricacies of interactions between different modalities, resulting in suboptimal feature representations. In this paper, we propose WavFusion, a multimodal framework designed for SER that tackles key research challenges, such as effective multimodal fusion, modality heterogeneity, and discriminative representation learning. By utilizing a gated cross-modal attention mechanism and multimodal homogeneous feature discrepancy learning, WavFusion outperforms existing state-of-the-art methods on benchmark datasets. Our research highlights the importance of capturing subtle cross-modal interactions and learning discriminative representations for accurate multimodal SER. Experimental results indicate that the proposed method is highly competitive and better than most of the latest state-of-the-art methods for SER. WavFusion achieves 0.78 % and 1.27 % improvement in accuracy and 0.74 % and 0.44 % improvement in weighted F1 score over the previous methods on the IEMOCAP and MELD datasets, respectively.</div></div>","PeriodicalId":50737,"journal":{"name":"Applied Soft Computing","volume":"185 ","pages":"Article 113915"},"PeriodicalIF":6.6000,"publicationDate":"2025-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Improving speech emotion recognition using gated cross-modal attention and multimodal homogeneous feature discrepancy learning\",\"authors\":\"Feng Li ,&nbsp;Jiusong Luo ,&nbsp;Wanjun Xia\",\"doi\":\"10.1016/j.asoc.2025.113915\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Speech emotion recognition (SER) remains a significant and crucial challenge due to the complex and multifaceted nature of human emotions. To tackle this challenge, researchers strive to integrate information from diverse modalities through multimodal learning. However, existing multimodal fusion techniques often overlook the intricacies of interactions between different modalities, resulting in suboptimal feature representations. In this paper, we propose WavFusion, a multimodal framework designed for SER that tackles key research challenges, such as effective multimodal fusion, modality heterogeneity, and discriminative representation learning. By utilizing a gated cross-modal attention mechanism and multimodal homogeneous feature discrepancy learning, WavFusion outperforms existing state-of-the-art methods on benchmark datasets. Our research highlights the importance of capturing subtle cross-modal interactions and learning discriminative representations for accurate multimodal SER. Experimental results indicate that the proposed method is highly competitive and better than most of the latest state-of-the-art methods for SER. WavFusion achieves 0.78 % and 1.27 % improvement in accuracy and 0.74 % and 0.44 % improvement in weighted F1 score over the previous methods on the IEMOCAP and MELD datasets, respectively.</div></div>\",\"PeriodicalId\":50737,\"journal\":{\"name\":\"Applied Soft Computing\",\"volume\":\"185 \",\"pages\":\"Article 113915\"},\"PeriodicalIF\":6.6000,\"publicationDate\":\"2025-09-17\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Applied Soft Computing\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S1568494625012281\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Applied Soft Computing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1568494625012281","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0

摘要

由于人类情感的复杂性和多面性,语音情感识别(SER)仍然是一个重要而关键的挑战。为了应对这一挑战,研究人员努力通过多模式学习来整合来自不同模式的信息。然而,现有的多模态融合技术往往忽略了不同模态之间相互作用的复杂性,导致特征表示不理想。在本文中,我们提出了WavFusion,这是一个为SER设计的多模态框架,用于解决关键的研究挑战,如有效的多模态融合、模态异质性和判别表示学习。通过利用门控跨模态注意机制和多模态同质特征差异学习,WavFusion在基准数据集上优于现有的最先进方法。我们的研究强调了捕捉微妙的跨模态相互作用和学习判别表征对于准确的多模态SER的重要性。实验结果表明,该方法具有很强的竞争力,并且优于大多数最新的SER方法。在IEMOCAP和MELD数据集上,与之前的方法相比,WavFusion的准确率分别提高了0.78%和1.27%,加权F1分数分别提高了0.74%和0.44%。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Improving speech emotion recognition using gated cross-modal attention and multimodal homogeneous feature discrepancy learning
Speech emotion recognition (SER) remains a significant and crucial challenge due to the complex and multifaceted nature of human emotions. To tackle this challenge, researchers strive to integrate information from diverse modalities through multimodal learning. However, existing multimodal fusion techniques often overlook the intricacies of interactions between different modalities, resulting in suboptimal feature representations. In this paper, we propose WavFusion, a multimodal framework designed for SER that tackles key research challenges, such as effective multimodal fusion, modality heterogeneity, and discriminative representation learning. By utilizing a gated cross-modal attention mechanism and multimodal homogeneous feature discrepancy learning, WavFusion outperforms existing state-of-the-art methods on benchmark datasets. Our research highlights the importance of capturing subtle cross-modal interactions and learning discriminative representations for accurate multimodal SER. Experimental results indicate that the proposed method is highly competitive and better than most of the latest state-of-the-art methods for SER. WavFusion achieves 0.78 % and 1.27 % improvement in accuracy and 0.74 % and 0.44 % improvement in weighted F1 score over the previous methods on the IEMOCAP and MELD datasets, respectively.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Applied Soft Computing
Applied Soft Computing 工程技术-计算机:跨学科应用
CiteScore
15.80
自引率
6.90%
发文量
874
审稿时长
10.9 months
期刊介绍: Applied Soft Computing is an international journal promoting an integrated view of soft computing to solve real life problems.The focus is to publish the highest quality research in application and convergence of the areas of Fuzzy Logic, Neural Networks, Evolutionary Computing, Rough Sets and other similar techniques to address real world complexities. Applied Soft Computing is a rolling publication: articles are published as soon as the editor-in-chief has accepted them. Therefore, the web site will continuously be updated with new articles and the publication time will be short.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信