{"title":"多模态情绪识别:整合语音和文本以改善效价、觉醒和优势预测","authors":"Messaoudi Awatef, Boughrara Hayet, Lachiri Zied","doi":"10.1007/s12243-025-01069-1","DOIUrl":null,"url":null,"abstract":"<div><p>While speech emotion recognition has traditionally focused on classifying emotions into discrete categories like happy or angry, recent research has shifted towards a dimensional approach using the Valence-Arousal-Dominance model. This model captures the continuous emotional state. However, research in speech emotion recognition (SER) consistently shows lower performance in predicting valence compared to arousal and dominance. To improve performance, we propose a system that combines acoustic and linguistic information. This work explores a novel multimodal approach for emotion recognition that combines speech and text data. This fusion strategy aims to outperform the traditional single-modality systems. Both early and late fusion techniques are investigated in this paper. Our findings show that combining modalities in a late fusion approach enhances system performance. In this late fusion architecture, the outputs from the acoustic deep learning network and the linguistic network are fed into two stacked dense neural network (NN) layers to predict valence, arousal, and dominance as continuous values. This approach leads to a significant improvement in overall emotion recognition performance compared to prior methods.</p></div>","PeriodicalId":50761,"journal":{"name":"Annals of Telecommunications","volume":"80 and networking","pages":"401 - 415"},"PeriodicalIF":2.2000,"publicationDate":"2025-02-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Multimodal emotion recognition: integrating speech and text for improved valence, arousal, and dominance prediction\",\"authors\":\"Messaoudi Awatef, Boughrara Hayet, Lachiri Zied\",\"doi\":\"10.1007/s12243-025-01069-1\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>While speech emotion recognition has traditionally focused on classifying emotions into discrete categories like happy or angry, recent research has shifted towards a dimensional approach using the Valence-Arousal-Dominance model. This model captures the continuous emotional state. However, research in speech emotion recognition (SER) consistently shows lower performance in predicting valence compared to arousal and dominance. To improve performance, we propose a system that combines acoustic and linguistic information. This work explores a novel multimodal approach for emotion recognition that combines speech and text data. This fusion strategy aims to outperform the traditional single-modality systems. Both early and late fusion techniques are investigated in this paper. Our findings show that combining modalities in a late fusion approach enhances system performance. In this late fusion architecture, the outputs from the acoustic deep learning network and the linguistic network are fed into two stacked dense neural network (NN) layers to predict valence, arousal, and dominance as continuous values. This approach leads to a significant improvement in overall emotion recognition performance compared to prior methods.</p></div>\",\"PeriodicalId\":50761,\"journal\":{\"name\":\"Annals of Telecommunications\",\"volume\":\"80 and networking\",\"pages\":\"401 - 415\"},\"PeriodicalIF\":2.2000,\"publicationDate\":\"2025-02-12\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Annals of Telecommunications\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://link.springer.com/article/10.1007/s12243-025-01069-1\",\"RegionNum\":4,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"TELECOMMUNICATIONS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Annals of Telecommunications","FirstCategoryId":"94","ListUrlMain":"https://link.springer.com/article/10.1007/s12243-025-01069-1","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"TELECOMMUNICATIONS","Score":null,"Total":0}
Multimodal emotion recognition: integrating speech and text for improved valence, arousal, and dominance prediction
While speech emotion recognition has traditionally focused on classifying emotions into discrete categories like happy or angry, recent research has shifted towards a dimensional approach using the Valence-Arousal-Dominance model. This model captures the continuous emotional state. However, research in speech emotion recognition (SER) consistently shows lower performance in predicting valence compared to arousal and dominance. To improve performance, we propose a system that combines acoustic and linguistic information. This work explores a novel multimodal approach for emotion recognition that combines speech and text data. This fusion strategy aims to outperform the traditional single-modality systems. Both early and late fusion techniques are investigated in this paper. Our findings show that combining modalities in a late fusion approach enhances system performance. In this late fusion architecture, the outputs from the acoustic deep learning network and the linguistic network are fed into two stacked dense neural network (NN) layers to predict valence, arousal, and dominance as continuous values. This approach leads to a significant improvement in overall emotion recognition performance compared to prior methods.
期刊介绍:
Annals of Telecommunications is an international journal publishing original peer-reviewed papers in the field of telecommunications. It covers all the essential branches of modern telecommunications, ranging from digital communications to communication networks and the internet, to software, protocols and services, uses and economics. This large spectrum of topics accounts for the rapid convergence through telecommunications of the underlying technologies in computers, communications, content management towards the emergence of the information and knowledge society. As a consequence, the Journal provides a medium for exchanging research results and technological achievements accomplished by the European and international scientific community from academia and industry.