{"title":"利用门控跨模态注意和多模态同质特征差异学习改进语音情感识别","authors":"Feng Li , Jiusong Luo , Wanjun Xia","doi":"10.1016/j.asoc.2025.113915","DOIUrl":null,"url":null,"abstract":"<div><div>Speech emotion recognition (SER) remains a significant and crucial challenge due to the complex and multifaceted nature of human emotions. To tackle this challenge, researchers strive to integrate information from diverse modalities through multimodal learning. However, existing multimodal fusion techniques often overlook the intricacies of interactions between different modalities, resulting in suboptimal feature representations. In this paper, we propose WavFusion, a multimodal framework designed for SER that tackles key research challenges, such as effective multimodal fusion, modality heterogeneity, and discriminative representation learning. By utilizing a gated cross-modal attention mechanism and multimodal homogeneous feature discrepancy learning, WavFusion outperforms existing state-of-the-art methods on benchmark datasets. Our research highlights the importance of capturing subtle cross-modal interactions and learning discriminative representations for accurate multimodal SER. Experimental results indicate that the proposed method is highly competitive and better than most of the latest state-of-the-art methods for SER. WavFusion achieves 0.78 % and 1.27 % improvement in accuracy and 0.74 % and 0.44 % improvement in weighted F1 score over the previous methods on the IEMOCAP and MELD datasets, respectively.</div></div>","PeriodicalId":50737,"journal":{"name":"Applied Soft Computing","volume":"185 ","pages":"Article 113915"},"PeriodicalIF":6.6000,"publicationDate":"2025-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Improving speech emotion recognition using gated cross-modal attention and multimodal homogeneous feature discrepancy learning\",\"authors\":\"Feng Li , Jiusong Luo , Wanjun Xia\",\"doi\":\"10.1016/j.asoc.2025.113915\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Speech emotion recognition (SER) remains a significant and crucial challenge due to the complex and multifaceted nature of human emotions. To tackle this challenge, researchers strive to integrate information from diverse modalities through multimodal learning. However, existing multimodal fusion techniques often overlook the intricacies of interactions between different modalities, resulting in suboptimal feature representations. In this paper, we propose WavFusion, a multimodal framework designed for SER that tackles key research challenges, such as effective multimodal fusion, modality heterogeneity, and discriminative representation learning. By utilizing a gated cross-modal attention mechanism and multimodal homogeneous feature discrepancy learning, WavFusion outperforms existing state-of-the-art methods on benchmark datasets. Our research highlights the importance of capturing subtle cross-modal interactions and learning discriminative representations for accurate multimodal SER. Experimental results indicate that the proposed method is highly competitive and better than most of the latest state-of-the-art methods for SER. WavFusion achieves 0.78 % and 1.27 % improvement in accuracy and 0.74 % and 0.44 % improvement in weighted F1 score over the previous methods on the IEMOCAP and MELD datasets, respectively.</div></div>\",\"PeriodicalId\":50737,\"journal\":{\"name\":\"Applied Soft Computing\",\"volume\":\"185 \",\"pages\":\"Article 113915\"},\"PeriodicalIF\":6.6000,\"publicationDate\":\"2025-09-17\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Applied Soft Computing\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S1568494625012281\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Applied Soft Computing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1568494625012281","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
Improving speech emotion recognition using gated cross-modal attention and multimodal homogeneous feature discrepancy learning
Speech emotion recognition (SER) remains a significant and crucial challenge due to the complex and multifaceted nature of human emotions. To tackle this challenge, researchers strive to integrate information from diverse modalities through multimodal learning. However, existing multimodal fusion techniques often overlook the intricacies of interactions between different modalities, resulting in suboptimal feature representations. In this paper, we propose WavFusion, a multimodal framework designed for SER that tackles key research challenges, such as effective multimodal fusion, modality heterogeneity, and discriminative representation learning. By utilizing a gated cross-modal attention mechanism and multimodal homogeneous feature discrepancy learning, WavFusion outperforms existing state-of-the-art methods on benchmark datasets. Our research highlights the importance of capturing subtle cross-modal interactions and learning discriminative representations for accurate multimodal SER. Experimental results indicate that the proposed method is highly competitive and better than most of the latest state-of-the-art methods for SER. WavFusion achieves 0.78 % and 1.27 % improvement in accuracy and 0.74 % and 0.44 % improvement in weighted F1 score over the previous methods on the IEMOCAP and MELD datasets, respectively.
期刊介绍:
Applied Soft Computing is an international journal promoting an integrated view of soft computing to solve real life problems.The focus is to publish the highest quality research in application and convergence of the areas of Fuzzy Logic, Neural Networks, Evolutionary Computing, Rough Sets and other similar techniques to address real world complexities.
Applied Soft Computing is a rolling publication: articles are published as soon as the editor-in-chief has accepted them. Therefore, the web site will continuously be updated with new articles and the publication time will be short.