Deep feature representations and fusion strategies for speech emotion recognition from acoustic and linguistic modalities: A systematic review

IF 3.4 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Speech and Language Pub Date : 2025-09-01 DOI:10.1016/j.csl.2025.101873

Andrea Chaves-Villota , Ana Jimenez-Martín , Mario Jojoa-Acosta , Alfonso Bahillo , Juan Jesús García-Domínguez

{"title":"Deep feature representations and fusion strategies for speech emotion recognition from acoustic and linguistic modalities: A systematic review","authors":"Andrea Chaves-Villota , Ana Jimenez-Martín , Mario Jojoa-Acosta , Alfonso Bahillo , Juan Jesús García-Domínguez","doi":"10.1016/j.csl.2025.101873","DOIUrl":null,"url":null,"abstract":"<div><div>Emotion Recognition (ER) has gained significant attention due to its importance in advanced human-machine interaction and its widespread real-world applications. In recent years, research on ER systems has focused on multiple key aspects, including the development of high-quality emotional databases, the selection of robust feature representations, and the implementation of advanced classifiers leveraging AI-based techniques. Despite this progress in research, ER still faces significant challenges and gaps that must be addressed to develop accurate and reliable systems. To systematically assess these critical aspects, particularly those centered on AI-based techniques, we employed the PRISMA methodology. Thus, we include journal and conference papers that provide essential insights into key parameters required for dataset development, involving emotion modeling (categorical or dimensional), the type of speech data (natural, acted, or elicited), the most common modalities integrated with acoustic and linguistic data from speech and the technologies used. Similarly, following this methodology, we identified the key representative features that serve as critical emotional information sources in both modalities. For acoustic, this included those extracted from the time and frequency domains, while for linguistic, earlier embeddings and the most common transformer models were considered. In addition, Deep Learning (DL) and attention-based methods were analyzed for both. Given the importance of effectively combining these diverse features for improving ER, we then explore fusion techniques based on the level of abstraction. Specifically, we focus on traditional approaches, including feature-, decision-, DL-, and attention-based fusion methods. Next, we provide a comparative analysis to assess the performance of the approaches included in our study. Our findings indicate that for the most commonly used datasets in the literature: IEMOCAP and MELD, the integration of acoustic and linguistic features reached a weighted accuracy (WA) of 85.71% and 63.80%, respectively. Finally, we discuss the main challenges and propose future guidelines that could enhance the performance of ER systems using acoustic and linguistic features from speech.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"96 ","pages":"Article 101873"},"PeriodicalIF":3.4000,"publicationDate":"2025-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Speech and Language","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0885230825000981","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Emotion Recognition (ER) has gained significant attention due to its importance in advanced human-machine interaction and its widespread real-world applications. In recent years, research on ER systems has focused on multiple key aspects, including the development of high-quality emotional databases, the selection of robust feature representations, and the implementation of advanced classifiers leveraging AI-based techniques. Despite this progress in research, ER still faces significant challenges and gaps that must be addressed to develop accurate and reliable systems. To systematically assess these critical aspects, particularly those centered on AI-based techniques, we employed the PRISMA methodology. Thus, we include journal and conference papers that provide essential insights into key parameters required for dataset development, involving emotion modeling (categorical or dimensional), the type of speech data (natural, acted, or elicited), the most common modalities integrated with acoustic and linguistic data from speech and the technologies used. Similarly, following this methodology, we identified the key representative features that serve as critical emotional information sources in both modalities. For acoustic, this included those extracted from the time and frequency domains, while for linguistic, earlier embeddings and the most common transformer models were considered. In addition, Deep Learning (DL) and attention-based methods were analyzed for both. Given the importance of effectively combining these diverse features for improving ER, we then explore fusion techniques based on the level of abstraction. Specifically, we focus on traditional approaches, including feature-, decision-, DL-, and attention-based fusion methods. Next, we provide a comparative analysis to assess the performance of the approaches included in our study. Our findings indicate that for the most commonly used datasets in the literature: IEMOCAP and MELD, the integration of acoustic and linguistic features reached a weighted accuracy (WA) of 85.71% and 63.80%, respectively. Finally, we discuss the main challenges and propose future guidelines that could enhance the performance of ER systems using acoustic and linguistic features from speech.

查看原文本刊更多论文

基于声学和语言模式的语音情感识别的深度特征表示和融合策略：系统综述

情感识别（ER）由于其在高级人机交互中的重要性和广泛的现实应用而受到广泛关注。近年来，对ER系统的研究集中在多个关键方面，包括开发高质量的情感数据库，选择鲁棒特征表示，以及利用基于人工智能的技术实现高级分类器。尽管在研究上取得了这些进展，但急诊仍然面临着重大的挑战和差距，必须解决这些挑战和差距才能开发出准确可靠的系统。为了系统地评估这些关键方面，特别是那些以人工智能为中心的技术，我们采用了PRISMA方法。因此，我们收录了期刊和会议论文，这些论文提供了对数据集开发所需的关键参数的基本见解，包括情感建模（分类或维度）、语音数据类型（自然、行为或引出）、与语音声学和语言数据集成的最常见模式以及所使用的技术。同样，按照这种方法，我们确定了在两种模式中作为关键情感信息源的关键代表性特征。对于声学，这包括从时域和频域提取的数据，而对于语言，则考虑了早期的嵌入和最常见的变压器模型。此外，对深度学习（DL）和基于注意的方法进行了分析。考虑到有效地结合这些不同的特征对于改善ER的重要性，我们随后探索了基于抽象级别的融合技术。具体来说，我们关注传统的方法，包括特征、决策、深度学习和基于注意力的融合方法。接下来，我们提供了一个比较分析，以评估我们研究中包括的方法的性能。研究结果表明，对于文献中最常用的数据集IEMOCAP和MELD，声学和语言特征的融合加权精度（WA）分别达到85.71%和63.80%。最后，我们讨论了主要的挑战，并提出了未来的指导方针，可以利用语音的声学和语言特征来提高ER系统的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Computer Speech and Language 工程技术-计算机：人工智能

CiteScore

11.30

自引率

4.70%

发文量

审稿时长

22.9 weeks

期刊介绍： Computer Speech & Language publishes reports of original research related to the recognition, understanding, production, coding and mining of speech and language. The speech and language sciences have a long history, but it is only relatively recently that large-scale implementation of and experimentation with complex models of speech and language processing has become feasible. Such research is often carried out somewhat separately by practitioners of artificial intelligence, computer science, electronic engineering, information retrieval, linguistics, phonetics, or psychology.