{"title":"单模态和多模态情感检测的综合综述:数据集、方法和局限性","authors":"Priyanka Thakur, Nirmal Kaur, Naveen Aggarwal, Sarbjeet Singh","doi":"10.1111/exsy.70103","DOIUrl":null,"url":null,"abstract":"<div>\n \n <p>Emotion detection from face and speech is inherent for human–computer interaction, mental health assessment, social robotics, and emotional intelligence. Traditional machine learning methods typically depend on handcrafted features and are primarily centred on unimodal systems. However, the unique characteristics of facial expressions and the variability in speech features present challenges in capturing complex emotional states. Accordingly, deep learning models have been substantial in automatically extracting intrinsic emotional features with greater accuracy across multiple modalities. The proposed article presents a comprehensive review of recent progress in emotion detection, spanning from unimodal to multimodal systems, with a focus on facial and speech modalities. It examines state-of-the-art machine learning, deep learning, and the latest transformer-based approaches for emotion detection. The review aims to provide an in-depth analysis of both unimodal and multimodal emotion detection techniques, highlighting their limitations, popular datasets, challenges, and the best-performing models. Such analysis aids researchers in judicious selection of the most appropriate dataset and audio-visual emotion detection models. Key findings suggest that integrating multimodal data significantly improves emotion recognition, particularly when utilising deep learning methods trained on synchronised audio and video datasets. By assessing recent advancements and current challenges, this article serves as a fundamental resource for researchers and practitioners in the field of emotional AI, thereby aiding in the creation of more intuitive and empathetic technologies.</p>\n </div>","PeriodicalId":51053,"journal":{"name":"Expert Systems","volume":"42 9","pages":""},"PeriodicalIF":3.0000,"publicationDate":"2025-07-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A Comprehensive Review of Unimodal and Multimodal Emotion Detection: Datasets, Approaches, and Limitations\",\"authors\":\"Priyanka Thakur, Nirmal Kaur, Naveen Aggarwal, Sarbjeet Singh\",\"doi\":\"10.1111/exsy.70103\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div>\\n \\n <p>Emotion detection from face and speech is inherent for human–computer interaction, mental health assessment, social robotics, and emotional intelligence. Traditional machine learning methods typically depend on handcrafted features and are primarily centred on unimodal systems. However, the unique characteristics of facial expressions and the variability in speech features present challenges in capturing complex emotional states. Accordingly, deep learning models have been substantial in automatically extracting intrinsic emotional features with greater accuracy across multiple modalities. The proposed article presents a comprehensive review of recent progress in emotion detection, spanning from unimodal to multimodal systems, with a focus on facial and speech modalities. It examines state-of-the-art machine learning, deep learning, and the latest transformer-based approaches for emotion detection. The review aims to provide an in-depth analysis of both unimodal and multimodal emotion detection techniques, highlighting their limitations, popular datasets, challenges, and the best-performing models. Such analysis aids researchers in judicious selection of the most appropriate dataset and audio-visual emotion detection models. Key findings suggest that integrating multimodal data significantly improves emotion recognition, particularly when utilising deep learning methods trained on synchronised audio and video datasets. By assessing recent advancements and current challenges, this article serves as a fundamental resource for researchers and practitioners in the field of emotional AI, thereby aiding in the creation of more intuitive and empathetic technologies.</p>\\n </div>\",\"PeriodicalId\":51053,\"journal\":{\"name\":\"Expert Systems\",\"volume\":\"42 9\",\"pages\":\"\"},\"PeriodicalIF\":3.0000,\"publicationDate\":\"2025-07-21\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Expert Systems\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://onlinelibrary.wiley.com/doi/10.1111/exsy.70103\",\"RegionNum\":4,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Expert Systems","FirstCategoryId":"94","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1111/exsy.70103","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
A Comprehensive Review of Unimodal and Multimodal Emotion Detection: Datasets, Approaches, and Limitations
Emotion detection from face and speech is inherent for human–computer interaction, mental health assessment, social robotics, and emotional intelligence. Traditional machine learning methods typically depend on handcrafted features and are primarily centred on unimodal systems. However, the unique characteristics of facial expressions and the variability in speech features present challenges in capturing complex emotional states. Accordingly, deep learning models have been substantial in automatically extracting intrinsic emotional features with greater accuracy across multiple modalities. The proposed article presents a comprehensive review of recent progress in emotion detection, spanning from unimodal to multimodal systems, with a focus on facial and speech modalities. It examines state-of-the-art machine learning, deep learning, and the latest transformer-based approaches for emotion detection. The review aims to provide an in-depth analysis of both unimodal and multimodal emotion detection techniques, highlighting their limitations, popular datasets, challenges, and the best-performing models. Such analysis aids researchers in judicious selection of the most appropriate dataset and audio-visual emotion detection models. Key findings suggest that integrating multimodal data significantly improves emotion recognition, particularly when utilising deep learning methods trained on synchronised audio and video datasets. By assessing recent advancements and current challenges, this article serves as a fundamental resource for researchers and practitioners in the field of emotional AI, thereby aiding in the creation of more intuitive and empathetic technologies.
期刊介绍:
Expert Systems: The Journal of Knowledge Engineering publishes papers dealing with all aspects of knowledge engineering, including individual methods and techniques in knowledge acquisition and representation, and their application in the construction of systems – including expert systems – based thereon. Detailed scientific evaluation is an essential part of any paper.
As well as traditional application areas, such as Software and Requirements Engineering, Human-Computer Interaction, and Artificial Intelligence, we are aiming at the new and growing markets for these technologies, such as Business, Economy, Market Research, and Medical and Health Care. The shift towards this new focus will be marked by a series of special issues covering hot and emergent topics.