Efficient bimodal emotion recognition system based on speech/text embeddings and ensemble learning fusion

IF 2.2 4区计算机科学 Q3 TELECOMMUNICATIONS

Annals of Telecommunications Pub Date : 2025-04-01 DOI:10.1007/s12243-025-01088-y

Adil Chakhtouna, Sara Sekkate, Abdellah Adib

{"title":"Efficient bimodal emotion recognition system based on speech/text embeddings and ensemble learning fusion","authors":"Adil Chakhtouna, Sara Sekkate, Abdellah Adib","doi":"10.1007/s12243-025-01088-y","DOIUrl":null,"url":null,"abstract":"<div><p>Emotion recognition (ER) is a pivotal discipline in the field of contemporary human–machine interaction. Its primary objective is to explore and advance theories, systems, and methodologies that can effectively recognize, comprehend, and interpret human emotions. This research investigates both unimodal and bimodal strategies for ER using advanced feature embeddings for audio and text data. We leverage pretrained models such as ImageBind for speech and RoBERTa, alongside traditional TF-IDF embeddings for text, to achieve accurate recognition of emotional states. A variety of machine learning (ML) and deep learning (DL) algorithms were implemented to evaluate their performance in speaker dependent (SD) and speaker independent (SI) scenarios. Additionally, three feature fusion methods, early fusion, majority voting fusion, and stacking ensemble fusion, were employed for the bimodal emotion recognition (BER) task. Extensive numerical simulations were conducted to systematically address the complexities and challenges associated with both unimodal and bimodal ER. Our most remarkable findings demonstrate an accuracy of <span>\\(86.75\\%\\)</span> in the SD scenario and <span>\\(64.04\\%\\)</span> in the SI scenario on the IEMOCAP database for the proposed BER system.</p></div>","PeriodicalId":50761,"journal":{"name":"Annals of Telecommunications","volume":"80 and networking","pages":"379 - 399"},"PeriodicalIF":2.2000,"publicationDate":"2025-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Annals of Telecommunications","FirstCategoryId":"94","ListUrlMain":"https://link.springer.com/article/10.1007/s12243-025-01088-y","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"TELECOMMUNICATIONS","Score":null,"Total":0}

引用次数: 0

Abstract

Emotion recognition (ER) is a pivotal discipline in the field of contemporary human–machine interaction. Its primary objective is to explore and advance theories, systems, and methodologies that can effectively recognize, comprehend, and interpret human emotions. This research investigates both unimodal and bimodal strategies for ER using advanced feature embeddings for audio and text data. We leverage pretrained models such as ImageBind for speech and RoBERTa, alongside traditional TF-IDF embeddings for text, to achieve accurate recognition of emotional states. A variety of machine learning (ML) and deep learning (DL) algorithms were implemented to evaluate their performance in speaker dependent (SD) and speaker independent (SI) scenarios. Additionally, three feature fusion methods, early fusion, majority voting fusion, and stacking ensemble fusion, were employed for the bimodal emotion recognition (BER) task. Extensive numerical simulations were conducted to systematically address the complexities and challenges associated with both unimodal and bimodal ER. Our most remarkable findings demonstrate an accuracy of \(86.75\%\) in the SD scenario and \(64.04\%\) in the SI scenario on the IEMOCAP database for the proposed BER system.

查看原文本刊更多论文

基于语音/文本嵌入和集成学习融合的高效双峰情感识别系统

情感识别是当代人机交互领域的一门关键学科。它的主要目标是探索和推进理论、系统和方法，可以有效地识别、理解和解释人类的情感。本研究使用音频和文本数据的高级特征嵌入来研究ER的单峰和双峰策略。我们利用预先训练的模型，如用于语音的ImageBind和RoBERTa，以及用于文本的传统TF-IDF嵌入，来实现对情绪状态的准确识别。实现了各种机器学习（ML）和深度学习（DL）算法，以评估它们在说话人依赖（SD）和说话人独立（SI）场景下的性能。此外，采用早期融合、多数投票融合和堆叠集成融合三种特征融合方法进行双峰情绪识别。我们进行了大量的数值模拟，以系统地解决与单峰和双峰ER相关的复杂性和挑战。我们最显著的发现表明，在IEMOCAP数据库中，对于提议的BER系统，SD场景的准确性为\(86.75\%\)， SI场景的准确性为\(64.04\%\)。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Annals of Telecommunications 工程技术-电信学

CiteScore

5.20

自引率

5.30%

发文量

审稿时长

4.5 months

期刊介绍： Annals of Telecommunications is an international journal publishing original peer-reviewed papers in the field of telecommunications. It covers all the essential branches of modern telecommunications, ranging from digital communications to communication networks and the internet, to software, protocols and services, uses and economics. This large spectrum of topics accounts for the rapid convergence through telecommunications of the underlying technologies in computers, communications, content management towards the emergence of the information and knowledge society. As a consequence, the Journal provides a medium for exchanging research results and technological achievements accomplished by the European and international scientific community from academia and industry.