{"title":"Multimodal Arabic emotion recognition using deep learning","authors":"Noora Al Roken, Gerassimos Barlas","doi":"10.1016/j.specom.2023.103005","DOIUrl":null,"url":null,"abstract":"<div><p>Emotion Recognition has been an active area for decades due to the complexity of the problem and its significance in human–computer interaction. Various methods have been employed to tackle this problem, leveraging different inputs such as speech, 2D and 3D images, audio signals, and text, all of which can convey emotional information. Recently, researchers have started combining multiple modalities to enhance the accuracy of emotion classification, recognizing that different emotions may be better expressed through different input types. This paper introduces a novel Arabic audio-visual natural-emotion dataset, investigates two existing multimodal classifiers, and proposes a new classifier trained on our Arabic dataset. Our evaluation encompasses different aspects, including variations in visual dataset sizes, joint and disjoint training, single and multimodal networks, as well as consecutive and overlapping segmentation. Through 5-fold cross-validation, our proposed classifier achieved exceptional results with an average F1-score of 0.912 and an accuracy of 0.913 for natural emotion recognition.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"155 ","pages":"Article 103005"},"PeriodicalIF":2.4000,"publicationDate":"2023-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Speech Communication","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0167639323001395","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ACOUSTICS","Score":null,"Total":0}
引用次数: 0
Abstract
Emotion Recognition has been an active area for decades due to the complexity of the problem and its significance in human–computer interaction. Various methods have been employed to tackle this problem, leveraging different inputs such as speech, 2D and 3D images, audio signals, and text, all of which can convey emotional information. Recently, researchers have started combining multiple modalities to enhance the accuracy of emotion classification, recognizing that different emotions may be better expressed through different input types. This paper introduces a novel Arabic audio-visual natural-emotion dataset, investigates two existing multimodal classifiers, and proposes a new classifier trained on our Arabic dataset. Our evaluation encompasses different aspects, including variations in visual dataset sizes, joint and disjoint training, single and multimodal networks, as well as consecutive and overlapping segmentation. Through 5-fold cross-validation, our proposed classifier achieved exceptional results with an average F1-score of 0.912 and an accuracy of 0.913 for natural emotion recognition.
期刊介绍:
Speech Communication is an interdisciplinary journal whose primary objective is to fulfil the need for the rapid dissemination and thorough discussion of basic and applied research results.
The journal''s primary objectives are:
• to present a forum for the advancement of human and human-machine speech communication science;
• to stimulate cross-fertilization between different fields of this domain;
• to contribute towards the rapid and wide diffusion of scientifically sound contributions in this domain.