语音情感识别中的多模态融合：方法与技术综述

IF 8 2区计算机科学 Q1 AUTOMATION & CONTROL SYSTEMS

Engineering Applications of Artificial Intelligence Pub Date : 2025-10-21 DOI:10.1016/j.engappai.2025.112624

Nhut Minh Nguyen , Thanh Trung Nguyen , Phuong-Nam Tran , Chee Peng Lim , Nhat Truong Pham , Duc Ngoc Minh Dang

{"title":"语音情感识别中的多模态融合：方法与技术综述","authors":"Nhut Minh Nguyen , Thanh Trung Nguyen , Phuong-Nam Tran , Chee Peng Lim , Nhat Truong Pham , Duc Ngoc Minh Dang","doi":"10.1016/j.engappai.2025.112624","DOIUrl":null,"url":null,"abstract":"<div><div>Speech emotion recognition (SER) plays a crucial role in human–computer interaction, enhancing numerous applications such as virtual assistants, healthcare monitoring, and customer support by identifying and interpreting emotions conveyed through spoken language. While unimodal SER systems demonstrate notable simplicity and computational efficiency, excelling in extracting critical features like vocal prosody and linguistic content, there is a pressing need to improve their performance in challenging conditions, such as noisy environments and the handling of ambiguous expressions or incomplete information. These challenges underscore the necessity of transitioning to multimodal approaches, which integrate complementary data sources to achieve more robust and accurate emotion detection. With advancements in artificial intelligence, especially in neural networks and deep learning, many studies have employed advanced deep learning and feature fusion techniques to enhance SER performance. This review synthesizes a comprehensive collection of publications from 2020 to 2024, exploring prominent multimodal fusion strategies, including early fusion, late fusion, deep fusion, and hybrid fusion methods, while also examining data representation, data translation, attention mechanisms, and graph-based fusion technologies. We assess the effectiveness of various fusion techniques across standard SER datasets, highlighting their performance in diverse tasks and addressing challenges related to data alignment, noise management, and computational demands. Furthermore, we highlight real-world applications of multimodal SER and provide critical research challenges that must be addressed for practical deployment, offering insights into optimal fusion strategies and guiding future developments in multimodal SER.</div></div>","PeriodicalId":50523,"journal":{"name":"Engineering Applications of Artificial Intelligence","volume":"163 ","pages":"Article 112624"},"PeriodicalIF":8.0000,"publicationDate":"2025-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Multimodal fusion in speech emotion recognition: A comprehensive review of methods and technologies\",\"authors\":\"Nhut Minh Nguyen , Thanh Trung Nguyen , Phuong-Nam Tran , Chee Peng Lim , Nhat Truong Pham , Duc Ngoc Minh Dang\",\"doi\":\"10.1016/j.engappai.2025.112624\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Speech emotion recognition (SER) plays a crucial role in human–computer interaction, enhancing numerous applications such as virtual assistants, healthcare monitoring, and customer support by identifying and interpreting emotions conveyed through spoken language. While unimodal SER systems demonstrate notable simplicity and computational efficiency, excelling in extracting critical features like vocal prosody and linguistic content, there is a pressing need to improve their performance in challenging conditions, such as noisy environments and the handling of ambiguous expressions or incomplete information. These challenges underscore the necessity of transitioning to multimodal approaches, which integrate complementary data sources to achieve more robust and accurate emotion detection. With advancements in artificial intelligence, especially in neural networks and deep learning, many studies have employed advanced deep learning and feature fusion techniques to enhance SER performance. This review synthesizes a comprehensive collection of publications from 2020 to 2024, exploring prominent multimodal fusion strategies, including early fusion, late fusion, deep fusion, and hybrid fusion methods, while also examining data representation, data translation, attention mechanisms, and graph-based fusion technologies. We assess the effectiveness of various fusion techniques across standard SER datasets, highlighting their performance in diverse tasks and addressing challenges related to data alignment, noise management, and computational demands. Furthermore, we highlight real-world applications of multimodal SER and provide critical research challenges that must be addressed for practical deployment, offering insights into optimal fusion strategies and guiding future developments in multimodal SER.</div></div>\",\"PeriodicalId\":50523,\"journal\":{\"name\":\"Engineering Applications of Artificial Intelligence\",\"volume\":\"163 \",\"pages\":\"Article 112624\"},\"PeriodicalIF\":8.0000,\"publicationDate\":\"2025-10-21\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Engineering Applications of Artificial Intelligence\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0952197625026557\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"AUTOMATION & CONTROL SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Engineering Applications of Artificial Intelligence","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0952197625026557","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"AUTOMATION & CONTROL SYSTEMS","Score":null,"Total":0}

引用次数: 0

摘要

语音情感识别（SER）在人机交互中起着至关重要的作用，通过识别和解释通过口语传达的情感，增强了虚拟助理、医疗保健监控和客户支持等许多应用程序。虽然单模态SER系统表现出显著的简单性和计算效率，在提取声乐韵律和语言内容等关键特征方面表现出色，但迫切需要提高其在具有挑战性的条件下的性能，例如嘈杂环境和处理模糊表达式或不完整信息。这些挑战强调了向多模态方法过渡的必要性，多模态方法整合了互补的数据源，以实现更强大、更准确的情绪检测。随着人工智能，特别是神经网络和深度学习的发展，许多研究都采用了先进的深度学习和特征融合技术来提高SER的性能。本综述综合了2020年至2024年的综合出版物，探讨了突出的多模态融合策略，包括早期融合、晚期融合、深度融合和混合融合方法，同时还研究了数据表示、数据翻译、注意机制和基于图的融合技术。我们评估了跨标准SER数据集的各种融合技术的有效性，突出了它们在不同任务中的性能，并解决了与数据对齐、噪声管理和计算需求相关的挑战。此外，我们强调了多模态SER的实际应用，并提供了实际部署必须解决的关键研究挑战，提供了最佳融合策略的见解，并指导了多模态SER的未来发展。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Multimodal fusion in speech emotion recognition: A comprehensive review of methods and technologies

Speech emotion recognition (SER) plays a crucial role in human–computer interaction, enhancing numerous applications such as virtual assistants, healthcare monitoring, and customer support by identifying and interpreting emotions conveyed through spoken language. While unimodal SER systems demonstrate notable simplicity and computational efficiency, excelling in extracting critical features like vocal prosody and linguistic content, there is a pressing need to improve their performance in challenging conditions, such as noisy environments and the handling of ambiguous expressions or incomplete information. These challenges underscore the necessity of transitioning to multimodal approaches, which integrate complementary data sources to achieve more robust and accurate emotion detection. With advancements in artificial intelligence, especially in neural networks and deep learning, many studies have employed advanced deep learning and feature fusion techniques to enhance SER performance. This review synthesizes a comprehensive collection of publications from 2020 to 2024, exploring prominent multimodal fusion strategies, including early fusion, late fusion, deep fusion, and hybrid fusion methods, while also examining data representation, data translation, attention mechanisms, and graph-based fusion technologies. We assess the effectiveness of various fusion techniques across standard SER datasets, highlighting their performance in diverse tasks and addressing challenges related to data alignment, noise management, and computational demands. Furthermore, we highlight real-world applications of multimodal SER and provide critical research challenges that must be addressed for practical deployment, offering insights into optimal fusion strategies and guiding future developments in multimodal SER.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Engineering Applications of Artificial Intelligence 工程技术-工程：电子与电气

CiteScore

9.60

自引率

10.00%

发文量

505

审稿时长

68 days

期刊介绍： Artificial Intelligence (AI) is pivotal in driving the fourth industrial revolution, witnessing remarkable advancements across various machine learning methodologies. AI techniques have become indispensable tools for practicing engineers, enabling them to tackle previously insurmountable challenges. Engineering Applications of Artificial Intelligence serves as a global platform for the swift dissemination of research elucidating the practical application of AI methods across all engineering disciplines. Submitted papers are expected to present novel aspects of AI utilized in real-world engineering applications, validated using publicly available datasets to ensure the replicability of research outcomes. Join us in exploring the transformative potential of AI in engineering.