{"title":"UMEDNet: a multimodal approach for emotion detection in the Urdu language.","authors":"Adil Majeed, Hasan Mujtaba","doi":"10.7717/peerj-cs.2861","DOIUrl":null,"url":null,"abstract":"<p><p>Emotion detection is a critical component of interaction between human and computer systems, more especially affective computing, and health screening. Integrating video, speech, and text information provides better coverage of the basic and derived affective states with improved estimation of verbal and non-verbal behavior. However, there is a lack of systematic preferences and models for the detection of emotions in low-resource languages such as Urdu. To this effect, we propose Urdu Multimodal Emotion Detection Network (UMEDNet), a new emotion detection model for Urdu that works with video, speech, and text inputs for a better understanding of emotion. To support our proposed UMEDNet, we created the Urdu Multimodal Emotion Detection (UMED) <i>corpus</i>, which is a seventeen-hour annotated <i>corpus</i> of five basic emotions. To the best of our knowledge, the current study provides the first <i>corpus</i> for detecting emotion in the context of multimodal emotion detection for the Urdu language and is extensible for extended research. UMEDNet leverages state-of-the-art techniques for feature extraction across modalities; for extracting facial features from video, both Multi-task Cascaded Convolutional Networks (MTCNN) and FaceNet were used with fine-tuned Wav2Vec2 for speech features and XLM-Roberta for text. These features are then projected into common latent spaces to enable the effective fusion of multimodal data and to enhance the accuracy of emotion prediction. The model demonstrates strong performance, achieving an overall accuracy of 85.27%, while precision, recall, and F1 scores, are all approximately equivalent. In the end, we analyzed the impact of UMEDNet and found that our model integrates data on different modalities and leads to better performance.</p>","PeriodicalId":54224,"journal":{"name":"PeerJ Computer Science","volume":"11 ","pages":"e2861"},"PeriodicalIF":3.5000,"publicationDate":"2025-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12192677/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"PeerJ Computer Science","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.7717/peerj-cs.2861","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/1/1 0:00:00","PubModel":"eCollection","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Emotion detection is a critical component of interaction between human and computer systems, more especially affective computing, and health screening. Integrating video, speech, and text information provides better coverage of the basic and derived affective states with improved estimation of verbal and non-verbal behavior. However, there is a lack of systematic preferences and models for the detection of emotions in low-resource languages such as Urdu. To this effect, we propose Urdu Multimodal Emotion Detection Network (UMEDNet), a new emotion detection model for Urdu that works with video, speech, and text inputs for a better understanding of emotion. To support our proposed UMEDNet, we created the Urdu Multimodal Emotion Detection (UMED) corpus, which is a seventeen-hour annotated corpus of five basic emotions. To the best of our knowledge, the current study provides the first corpus for detecting emotion in the context of multimodal emotion detection for the Urdu language and is extensible for extended research. UMEDNet leverages state-of-the-art techniques for feature extraction across modalities; for extracting facial features from video, both Multi-task Cascaded Convolutional Networks (MTCNN) and FaceNet were used with fine-tuned Wav2Vec2 for speech features and XLM-Roberta for text. These features are then projected into common latent spaces to enable the effective fusion of multimodal data and to enhance the accuracy of emotion prediction. The model demonstrates strong performance, achieving an overall accuracy of 85.27%, while precision, recall, and F1 scores, are all approximately equivalent. In the end, we analyzed the impact of UMEDNet and found that our model integrates data on different modalities and leads to better performance.
期刊介绍:
PeerJ Computer Science is the new open access journal covering all subject areas in computer science, with the backing of a prestigious advisory board and more than 300 academic editors.