UMEDNet: a multimodal approach for emotion detection in the Urdu language.

IF 3.5 4区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

PeerJ Computer Science Pub Date : 2025-05-01 eCollection Date: 2025-01-01 DOI:10.7717/peerj-cs.2861

Adil Majeed, Hasan Mujtaba

{"title":"UMEDNet: a multimodal approach for emotion detection in the Urdu language.","authors":"Adil Majeed, Hasan Mujtaba","doi":"10.7717/peerj-cs.2861","DOIUrl":null,"url":null,"abstract":"Emotion detection is a critical component of interaction between human and computer systems, more especially affective computing, and health screening. Integrating video, speech, and text information provides better coverage of the basic and derived affective states with improved estimation of verbal and non-verbal behavior. However, there is a lack of systematic preferences and models for the detection of emotions in low-resource languages such as Urdu. To this effect, we propose Urdu Multimodal Emotion Detection Network (UMEDNet), a new emotion detection model for Urdu that works with video, speech, and text inputs for a better understanding of emotion. To support our proposed UMEDNet, we created the Urdu Multimodal Emotion Detection (UMED) corpus, which is a seventeen-hour annotated corpus of five basic emotions. To the best of our knowledge, the current study provides the first corpus for detecting emotion in the context of multimodal emotion detection for the Urdu language and is extensible for extended research. UMEDNet leverages state-of-the-art techniques for feature extraction across modalities; for extracting facial features from video, both Multi-task Cascaded Convolutional Networks (MTCNN) and FaceNet were used with fine-tuned Wav2Vec2 for speech features and XLM-Roberta for text. These features are then projected into common latent spaces to enable the effective fusion of multimodal data and to enhance the accuracy of emotion prediction. The model demonstrates strong performance, achieving an overall accuracy of 85.27%, while precision, recall, and F1 scores, are all approximately equivalent. In the end, we analyzed the impact of UMEDNet and found that our model integrates data on different modalities and leads to better performance.","PeriodicalId":54224,"journal":{"name":"PeerJ Computer Science","volume":"11 ","pages":"e2861"},"PeriodicalIF":3.5000,"publicationDate":"2025-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12192677/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"PeerJ Computer Science","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.7717/peerj-cs.2861","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/1/1 0:00:00","PubModel":"eCollection","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Emotion detection is a critical component of interaction between human and computer systems, more especially affective computing, and health screening. Integrating video, speech, and text information provides better coverage of the basic and derived affective states with improved estimation of verbal and non-verbal behavior. However, there is a lack of systematic preferences and models for the detection of emotions in low-resource languages such as Urdu. To this effect, we propose Urdu Multimodal Emotion Detection Network (UMEDNet), a new emotion detection model for Urdu that works with video, speech, and text inputs for a better understanding of emotion. To support our proposed UMEDNet, we created the Urdu Multimodal Emotion Detection (UMED) corpus, which is a seventeen-hour annotated corpus of five basic emotions. To the best of our knowledge, the current study provides the first corpus for detecting emotion in the context of multimodal emotion detection for the Urdu language and is extensible for extended research. UMEDNet leverages state-of-the-art techniques for feature extraction across modalities; for extracting facial features from video, both Multi-task Cascaded Convolutional Networks (MTCNN) and FaceNet were used with fine-tuned Wav2Vec2 for speech features and XLM-Roberta for text. These features are then projected into common latent spaces to enable the effective fusion of multimodal data and to enhance the accuracy of emotion prediction. The model demonstrates strong performance, achieving an overall accuracy of 85.27%, while precision, recall, and F1 scores, are all approximately equivalent. In the end, we analyzed the impact of UMEDNet and found that our model integrates data on different modalities and leads to better performance.

查看原文本刊更多论文

UMEDNet：乌尔都语情感检测的多模态方法。

情感检测是人机交互的重要组成部分，尤其是情感计算和健康筛查。整合视频、语音和文本信息可以更好地覆盖基本和衍生的情感状态，提高对语言和非语言行为的估计。然而，对于乌尔都语等低资源语言的情感检测，缺乏系统的偏好和模型。为此，我们提出了乌尔都语多模态情感检测网络（UMEDNet），这是一种新的乌尔都语情感检测模型，可用于视频、语音和文本输入，以更好地理解情感。为了支持我们提出的UMEDNet，我们创建了乌尔都语多模态情感检测（UMED）语料库，这是一个包含五种基本情绪的17小时注释语料库。据我们所知，目前的研究提供了第一个在乌尔都语多模态情感检测背景下的情感检测语料库，并且可以扩展到进一步的研究中。UMEDNet利用最先进的技术进行跨模式的特征提取；为了从视频中提取面部特征，使用了多任务级联卷积网络（MTCNN）和FaceNet，语音特征使用了经过微调的Wav2Vec2，文本特征使用了XLM-Roberta。然后将这些特征投影到共同的潜在空间中，从而实现多模态数据的有效融合，提高情绪预测的准确性。该模型表现出较强的性能，总体准确率达到85.27%，而精度、召回率和F1分数都大致相当。最后，我们分析了UMEDNet的影响，发现我们的模型集成了不同模式的数据，并带来了更好的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

PeerJ Computer Science Computer Science-General Computer Science

CiteScore

6.10

自引率

5.30%

发文量

332

审稿时长

10 weeks

期刊介绍： PeerJ Computer Science is the new open access journal covering all subject areas in computer science, with the backing of a prestigious advisory board and more than 300 academic editors.