利用门控跨模态注意和多模态同质特征差异学习改进语音情感识别

IF 6.6 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Applied Soft Computing Pub Date : 2025-09-17 DOI:10.1016/j.asoc.2025.113915

Feng Li , Jiusong Luo , Wanjun Xia

{"title":"利用门控跨模态注意和多模态同质特征差异学习改进语音情感识别","authors":"Feng Li , Jiusong Luo , Wanjun Xia","doi":"10.1016/j.asoc.2025.113915","DOIUrl":null,"url":null,"abstract":"<div><div>Speech emotion recognition (SER) remains a significant and crucial challenge due to the complex and multifaceted nature of human emotions. To tackle this challenge, researchers strive to integrate information from diverse modalities through multimodal learning. However, existing multimodal fusion techniques often overlook the intricacies of interactions between different modalities, resulting in suboptimal feature representations. In this paper, we propose WavFusion, a multimodal framework designed for SER that tackles key research challenges, such as effective multimodal fusion, modality heterogeneity, and discriminative representation learning. By utilizing a gated cross-modal attention mechanism and multimodal homogeneous feature discrepancy learning, WavFusion outperforms existing state-of-the-art methods on benchmark datasets. Our research highlights the importance of capturing subtle cross-modal interactions and learning discriminative representations for accurate multimodal SER. Experimental results indicate that the proposed method is highly competitive and better than most of the latest state-of-the-art methods for SER. WavFusion achieves 0.78 % and 1.27 % improvement in accuracy and 0.74 % and 0.44 % improvement in weighted F1 score over the previous methods on the IEMOCAP and MELD datasets, respectively.</div></div>","PeriodicalId":50737,"journal":{"name":"Applied Soft Computing","volume":"185 ","pages":"Article 113915"},"PeriodicalIF":6.6000,"publicationDate":"2025-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Improving speech emotion recognition using gated cross-modal attention and multimodal homogeneous feature discrepancy learning\",\"authors\":\"Feng Li , Jiusong Luo , Wanjun Xia\",\"doi\":\"10.1016/j.asoc.2025.113915\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Speech emotion recognition (SER) remains a significant and crucial challenge due to the complex and multifaceted nature of human emotions. To tackle this challenge, researchers strive to integrate information from diverse modalities through multimodal learning. However, existing multimodal fusion techniques often overlook the intricacies of interactions between different modalities, resulting in suboptimal feature representations. In this paper, we propose WavFusion, a multimodal framework designed for SER that tackles key research challenges, such as effective multimodal fusion, modality heterogeneity, and discriminative representation learning. By utilizing a gated cross-modal attention mechanism and multimodal homogeneous feature discrepancy learning, WavFusion outperforms existing state-of-the-art methods on benchmark datasets. Our research highlights the importance of capturing subtle cross-modal interactions and learning discriminative representations for accurate multimodal SER. Experimental results indicate that the proposed method is highly competitive and better than most of the latest state-of-the-art methods for SER. WavFusion achieves 0.78 % and 1.27 % improvement in accuracy and 0.74 % and 0.44 % improvement in weighted F1 score over the previous methods on the IEMOCAP and MELD datasets, respectively.</div></div>\",\"PeriodicalId\":50737,\"journal\":{\"name\":\"Applied Soft Computing\",\"volume\":\"185 \",\"pages\":\"Article 113915\"},\"PeriodicalIF\":6.6000,\"publicationDate\":\"2025-09-17\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Applied Soft Computing\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S1568494625012281\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Applied Soft Computing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1568494625012281","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

由于人类情感的复杂性和多面性，语音情感识别（SER）仍然是一个重要而关键的挑战。为了应对这一挑战，研究人员努力通过多模式学习来整合来自不同模式的信息。然而，现有的多模态融合技术往往忽略了不同模态之间相互作用的复杂性，导致特征表示不理想。在本文中，我们提出了WavFusion，这是一个为SER设计的多模态框架，用于解决关键的研究挑战，如有效的多模态融合、模态异质性和判别表示学习。通过利用门控跨模态注意机制和多模态同质特征差异学习，WavFusion在基准数据集上优于现有的最先进方法。我们的研究强调了捕捉微妙的跨模态相互作用和学习判别表征对于准确的多模态SER的重要性。实验结果表明，该方法具有很强的竞争力，并且优于大多数最新的SER方法。在IEMOCAP和MELD数据集上，与之前的方法相比，WavFusion的准确率分别提高了0.78%和1.27%，加权F1分数分别提高了0.74%和0.44%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Improving speech emotion recognition using gated cross-modal attention and multimodal homogeneous feature discrepancy learning

Speech emotion recognition (SER) remains a significant and crucial challenge due to the complex and multifaceted nature of human emotions. To tackle this challenge, researchers strive to integrate information from diverse modalities through multimodal learning. However, existing multimodal fusion techniques often overlook the intricacies of interactions between different modalities, resulting in suboptimal feature representations. In this paper, we propose WavFusion, a multimodal framework designed for SER that tackles key research challenges, such as effective multimodal fusion, modality heterogeneity, and discriminative representation learning. By utilizing a gated cross-modal attention mechanism and multimodal homogeneous feature discrepancy learning, WavFusion outperforms existing state-of-the-art methods on benchmark datasets. Our research highlights the importance of capturing subtle cross-modal interactions and learning discriminative representations for accurate multimodal SER. Experimental results indicate that the proposed method is highly competitive and better than most of the latest state-of-the-art methods for SER. WavFusion achieves 0.78 % and 1.27 % improvement in accuracy and 0.74 % and 0.44 % improvement in weighted F1 score over the previous methods on the IEMOCAP and MELD datasets, respectively.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Applied Soft Computing 工程技术-计算机：跨学科应用

CiteScore

15.80

自引率

6.90%

发文量

874

审稿时长

10.9 months

期刊介绍： Applied Soft Computing is an international journal promoting an integrated view of soft computing to solve real life problems.The focus is to publish the highest quality research in application and convergence of the areas of Fuzzy Logic, Neural Networks, Evolutionary Computing, Rough Sets and other similar techniques to address real world complexities. Applied Soft Computing is a rolling publication: articles are published as soon as the editor-in-chief has accepted them. Therefore, the web site will continuously be updated with new articles and the publication time will be short.