单模态和多模态情感检测的综合综述：数据集、方法和局限性

IF 3 4区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Expert Systems Pub Date : 2025-07-21 DOI:10.1111/exsy.70103

Priyanka Thakur, Nirmal Kaur, Naveen Aggarwal, Sarbjeet Singh

{"title":"单模态和多模态情感检测的综合综述：数据集、方法和局限性","authors":"Priyanka Thakur, Nirmal Kaur, Naveen Aggarwal, Sarbjeet Singh","doi":"10.1111/exsy.70103","DOIUrl":null,"url":null,"abstract":"<div>\n \n <p>Emotion detection from face and speech is inherent for human–computer interaction, mental health assessment, social robotics, and emotional intelligence. Traditional machine learning methods typically depend on handcrafted features and are primarily centred on unimodal systems. However, the unique characteristics of facial expressions and the variability in speech features present challenges in capturing complex emotional states. Accordingly, deep learning models have been substantial in automatically extracting intrinsic emotional features with greater accuracy across multiple modalities. The proposed article presents a comprehensive review of recent progress in emotion detection, spanning from unimodal to multimodal systems, with a focus on facial and speech modalities. It examines state-of-the-art machine learning, deep learning, and the latest transformer-based approaches for emotion detection. The review aims to provide an in-depth analysis of both unimodal and multimodal emotion detection techniques, highlighting their limitations, popular datasets, challenges, and the best-performing models. Such analysis aids researchers in judicious selection of the most appropriate dataset and audio-visual emotion detection models. Key findings suggest that integrating multimodal data significantly improves emotion recognition, particularly when utilising deep learning methods trained on synchronised audio and video datasets. By assessing recent advancements and current challenges, this article serves as a fundamental resource for researchers and practitioners in the field of emotional AI, thereby aiding in the creation of more intuitive and empathetic technologies.</p>\n </div>","PeriodicalId":51053,"journal":{"name":"Expert Systems","volume":"42 9","pages":""},"PeriodicalIF":3.0000,"publicationDate":"2025-07-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A Comprehensive Review of Unimodal and Multimodal Emotion Detection: Datasets, Approaches, and Limitations\",\"authors\":\"Priyanka Thakur, Nirmal Kaur, Naveen Aggarwal, Sarbjeet Singh\",\"doi\":\"10.1111/exsy.70103\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div>\\n \\n <p>Emotion detection from face and speech is inherent for human–computer interaction, mental health assessment, social robotics, and emotional intelligence. Traditional machine learning methods typically depend on handcrafted features and are primarily centred on unimodal systems. However, the unique characteristics of facial expressions and the variability in speech features present challenges in capturing complex emotional states. Accordingly, deep learning models have been substantial in automatically extracting intrinsic emotional features with greater accuracy across multiple modalities. The proposed article presents a comprehensive review of recent progress in emotion detection, spanning from unimodal to multimodal systems, with a focus on facial and speech modalities. It examines state-of-the-art machine learning, deep learning, and the latest transformer-based approaches for emotion detection. The review aims to provide an in-depth analysis of both unimodal and multimodal emotion detection techniques, highlighting their limitations, popular datasets, challenges, and the best-performing models. Such analysis aids researchers in judicious selection of the most appropriate dataset and audio-visual emotion detection models. Key findings suggest that integrating multimodal data significantly improves emotion recognition, particularly when utilising deep learning methods trained on synchronised audio and video datasets. By assessing recent advancements and current challenges, this article serves as a fundamental resource for researchers and practitioners in the field of emotional AI, thereby aiding in the creation of more intuitive and empathetic technologies.</p>\\n </div>\",\"PeriodicalId\":51053,\"journal\":{\"name\":\"Expert Systems\",\"volume\":\"42 9\",\"pages\":\"\"},\"PeriodicalIF\":3.0000,\"publicationDate\":\"2025-07-21\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Expert Systems\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://onlinelibrary.wiley.com/doi/10.1111/exsy.70103\",\"RegionNum\":4,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Expert Systems","FirstCategoryId":"94","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1111/exsy.70103","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

从面部和语言中检测情感是人机交互、心理健康评估、社交机器人和情商的固有功能。传统的机器学习方法通常依赖于手工制作的特征，并且主要集中在单峰系统上。然而，面部表情的独特特征和语音特征的可变性给捕捉复杂的情绪状态带来了挑战。因此，深度学习模型在跨多种模态以更高的精度自动提取内在情感特征方面已经取得了实质性进展。本文对情感检测的最新进展进行了全面回顾，从单模态到多模态系统，重点是面部和语音模态。它研究了最先进的机器学习、深度学习和最新的基于变压器的情感检测方法。这篇综述旨在对单模态和多模态情感检测技术进行深入分析，强调它们的局限性、流行的数据集、挑战和最佳表现模型。这种分析有助于研究人员明智地选择最合适的数据集和视听情感检测模型。主要研究结果表明，集成多模态数据可以显著提高情绪识别能力，特别是在使用同步音频和视频数据集训练的深度学习方法时。通过评估最近的进展和当前的挑战，本文可以作为情感人工智能领域的研究人员和从业者的基础资源，从而帮助创造更直观和同理心的技术。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

A Comprehensive Review of Unimodal and Multimodal Emotion Detection: Datasets, Approaches, and Limitations

Emotion detection from face and speech is inherent for human–computer interaction, mental health assessment, social robotics, and emotional intelligence. Traditional machine learning methods typically depend on handcrafted features and are primarily centred on unimodal systems. However, the unique characteristics of facial expressions and the variability in speech features present challenges in capturing complex emotional states. Accordingly, deep learning models have been substantial in automatically extracting intrinsic emotional features with greater accuracy across multiple modalities. The proposed article presents a comprehensive review of recent progress in emotion detection, spanning from unimodal to multimodal systems, with a focus on facial and speech modalities. It examines state-of-the-art machine learning, deep learning, and the latest transformer-based approaches for emotion detection. The review aims to provide an in-depth analysis of both unimodal and multimodal emotion detection techniques, highlighting their limitations, popular datasets, challenges, and the best-performing models. Such analysis aids researchers in judicious selection of the most appropriate dataset and audio-visual emotion detection models. Key findings suggest that integrating multimodal data significantly improves emotion recognition, particularly when utilising deep learning methods trained on synchronised audio and video datasets. By assessing recent advancements and current challenges, this article serves as a fundamental resource for researchers and practitioners in the field of emotional AI, thereby aiding in the creation of more intuitive and empathetic technologies.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Expert Systems 工程技术-计算机：理论方法

CiteScore

7.40

自引率

6.10%

发文量

266

审稿时长

24 months

期刊介绍： Expert Systems: The Journal of Knowledge Engineering publishes papers dealing with all aspects of knowledge engineering, including individual methods and techniques in knowledge acquisition and representation, and their application in the construction of systems – including expert systems – based thereon. Detailed scientific evaluation is an essential part of any paper. As well as traditional application areas, such as Software and Requirements Engineering, Human-Computer Interaction, and Artificial Intelligence, we are aiming at the new and growing markets for these technologies, such as Business, Economy, Market Research, and Medical and Health Care. The shift towards this new focus will be marked by a series of special issues covering hot and emergent topics.