Deep learning model for simultaneous recognition of quantitative and qualitative emotion using visual and bio-sensing data

IF 4.3 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Vision and Image Understanding Pub Date : 2024-08-22 DOI:10.1016/j.cviu.2024.104121

Iman Hosseini , Md Zakir Hossain , Yuhao Zhang , Shafin Rahman

{"title":"Deep learning model for simultaneous recognition of quantitative and qualitative emotion using visual and bio-sensing data","authors":"Iman Hosseini , Md Zakir Hossain , Yuhao Zhang , Shafin Rahman","doi":"10.1016/j.cviu.2024.104121","DOIUrl":null,"url":null,"abstract":"<div><p>The recognition of emotions heavily relies on important factors such as human facial expressions and physiological signals, including electroencephalogram and electrocardiogram. In literature, emotion recognition is investigated quantitatively (while estimating valance, arousal, and dominance) and qualitatively (while predicting discrete emotions like happiness, sadness, anger, surprise, and so on). Current methods utilize a combination of visual data and bio-sensing information to create recognition systems that incorporate multiple modes (quantitative/qualitative). Nevertheless, these methods necessitate extensive expertise in specific domains and intricate preprocessing procedures, and consequently, they are unable to fully leverage the inherent advantages of end-to-end deep learning techniques. Moreover, methods usually aim to recognize either qualitative or quantitative emotions. Although both kinds of emotions are significantly co-related, previous methods do not simultaneously recognize qualitative and quantitative emotions. In this paper, a novel deep end-to-end framework named DeepVADNet is introduced, specifically designed for the purpose of multi-modal emotion recognition. The proposed framework leverages deep learning techniques to effectively extract crucial face appearance features as well as bio-sensing features, predicting both qualitative and quantitative emotions in a single forward pass. In this study, we employ the CRNN architecture to extract face appearance features, while the ConvLSTM model is utilized to extract spatio-temporal information from visual data (videos). Additionally, we utilize the Conv1D model for processing physiological signals (EEG, EOG, ECG, and GSR) as this approach deviates from conventional manual techniques that involve traditional manual methods for extracting features based on time and frequency domains. After enhancing the feature quality by fusing both modalities, we use a novel method employing quantitative emotion to predict qualitative emotions accurately. We perform extensive experiments on the DEAP and MAHNOB-HCI datasets, achieving state-of-the-art quantitative emotion recognition results of 98.93%/6e-4 and 89.08%/0.97 (mean classification accuracy/MSE) in both datasets, respectively. Also, for the qualitative emotion recognition task, we achieve 82.71% mean classification accuracy on the MAHNOB-HCI dataset. The code and evaluation can be accessed at: <span><span>https://github.com/I-Man-H/DeepVADNet.git</span><svg><path></path></svg></span></p></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"248 ","pages":"Article 104121"},"PeriodicalIF":4.3000,"publicationDate":"2024-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Vision and Image Understanding","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1077314224002029","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

The recognition of emotions heavily relies on important factors such as human facial expressions and physiological signals, including electroencephalogram and electrocardiogram. In literature, emotion recognition is investigated quantitatively (while estimating valance, arousal, and dominance) and qualitatively (while predicting discrete emotions like happiness, sadness, anger, surprise, and so on). Current methods utilize a combination of visual data and bio-sensing information to create recognition systems that incorporate multiple modes (quantitative/qualitative). Nevertheless, these methods necessitate extensive expertise in specific domains and intricate preprocessing procedures, and consequently, they are unable to fully leverage the inherent advantages of end-to-end deep learning techniques. Moreover, methods usually aim to recognize either qualitative or quantitative emotions. Although both kinds of emotions are significantly co-related, previous methods do not simultaneously recognize qualitative and quantitative emotions. In this paper, a novel deep end-to-end framework named DeepVADNet is introduced, specifically designed for the purpose of multi-modal emotion recognition. The proposed framework leverages deep learning techniques to effectively extract crucial face appearance features as well as bio-sensing features, predicting both qualitative and quantitative emotions in a single forward pass. In this study, we employ the CRNN architecture to extract face appearance features, while the ConvLSTM model is utilized to extract spatio-temporal information from visual data (videos). Additionally, we utilize the Conv1D model for processing physiological signals (EEG, EOG, ECG, and GSR) as this approach deviates from conventional manual techniques that involve traditional manual methods for extracting features based on time and frequency domains. After enhancing the feature quality by fusing both modalities, we use a novel method employing quantitative emotion to predict qualitative emotions accurately. We perform extensive experiments on the DEAP and MAHNOB-HCI datasets, achieving state-of-the-art quantitative emotion recognition results of 98.93%/6e-4 and 89.08%/0.97 (mean classification accuracy/MSE) in both datasets, respectively. Also, for the qualitative emotion recognition task, we achieve 82.71% mean classification accuracy on the MAHNOB-HCI dataset. The code and evaluation can be accessed at: https://github.com/I-Man-H/DeepVADNet.git

查看原文本刊更多论文

利用视觉和生物传感数据同时识别定量和定性情绪的深度学习模型

情绪识别在很大程度上依赖于人的面部表情和包括脑电图和心电图在内的生理信号等重要因素。文献对情绪识别进行了定量研究（同时估算情绪价值、唤醒程度和主导地位）和定性研究（同时预测离散情绪，如快乐、悲伤、愤怒、惊讶等）。目前的方法是将视觉数据和生物传感信息结合起来，创建包含多种模式（定量/定性）的识别系统。然而，这些方法需要特定领域的广泛专业知识和复杂的预处理程序，因此无法充分利用端到端深度学习技术的固有优势。此外，这些方法通常旨在识别定性或定量情绪。虽然这两种情绪有很大的共通性，但以往的方法并不能同时识别定性和定量情绪。本文介绍了一种名为 DeepVADNet 的新型深度端到端框架，它是专门为多模态情感识别而设计的。该框架利用深度学习技术有效地提取了关键的人脸外观特征和生物传感特征，只需一次前向传递即可预测定性和定量情绪。在本研究中，我们采用 CRNN 架构提取人脸外观特征，同时利用 ConvLSTM 模型从视觉数据（视频）中提取时空信息。此外，我们还利用 Conv1D 模型来处理生理信号（脑电图、眼电图、心电图和 GSR），因为这种方法不同于传统的人工技术，传统的人工技术是基于时域和频域提取特征。在通过融合两种模式提高特征质量之后，我们采用了一种新方法，利用定量情绪来准确预测定性情绪。我们在 DEAP 和 MAHNOB-HCI 数据集上进行了大量实验，在这两个数据集上分别取得了 98.93%/6e-4 和 89.08%/0.97 （平均分类准确率/MSE）的一流定量情绪识别结果。此外，在定性情感识别任务中，我们在 MAHNOB-HCI 数据集上取得了 82.71% 的平均分类准确率。代码和评估可从以下网址获取： https://github.com/I-Man-H/DeepVADNet.git

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Computer Vision and Image Understanding 工程技术-工程：电子与电气

CiteScore

7.80

自引率

4.40%

发文量

112

审稿时长

79 days

期刊介绍： The central focus of this journal is the computer analysis of pictorial information. Computer Vision and Image Understanding publishes papers covering all aspects of image analysis from the low-level, iconic processes of early vision to the high-level, symbolic processes of recognition and interpretation. A wide range of topics in the image understanding area is covered, including papers offering insights that differ from predominant views. Research Areas Include: • Theory • Early vision • Data structures and representations • Shape • Range • Motion • Matching and recognition • Architecture and languages • Vision systems