Audio-Visual Emotion Recognition Using K-Means Clustering and Spatio-Temporal CNN

2023 6th International Conference on Pattern Recognition and Image Analysis (IPRIA) Pub Date : 2023-02-14 DOI:10.1109/IPRIA59240.2023.10147192

Masoumeh Sharafi, M. Yazdchi, J. Rasti

{"title":"Audio-Visual Emotion Recognition Using K-Means Clustering and Spatio-Temporal CNN","authors":"Masoumeh Sharafi, M. Yazdchi, J. Rasti","doi":"10.1109/IPRIA59240.2023.10147192","DOIUrl":null,"url":null,"abstract":"Emotion recognition is a challenging task due to the emotional gap between subjective feeling and low-level audio-visual characteristics. Thus, the development of a feasible approach for high-performance emotion recognition might enhance human-computer interaction. Deep learning methods have enhanced the performance of emotion recognition systems in comparison to other current methods. In this paper, a multimodal deep convolutional neural network (CNN) and bidirectional long short-term memory (BiLSTM) network are proposed, which fuses the audio and visual cues in a deep model. The spatial and temporal features extracted from video frames are fused with short-term Fourier transform (STFT) extracted from audio signals. Finally, a Softmax classifier is used to classify inputs into seven groups: anger, disgust, fear, happiness, sadness, surprise, and neutral mode. The proposed model is evaluated on Surrey Audio-Visual Expressed Emotion (SAVEE) database with an accuracy of 95.48%. Our experimental study reveals that the suggested method is more effective than existing algorithms in adapting to emotion recognition in this dataset.","PeriodicalId":109390,"journal":{"name":"2023 6th International Conference on Pattern Recognition and Image Analysis (IPRIA)","volume":"24 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-02-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 6th International Conference on Pattern Recognition and Image Analysis (IPRIA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IPRIA59240.2023.10147192","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Emotion recognition is a challenging task due to the emotional gap between subjective feeling and low-level audio-visual characteristics. Thus, the development of a feasible approach for high-performance emotion recognition might enhance human-computer interaction. Deep learning methods have enhanced the performance of emotion recognition systems in comparison to other current methods. In this paper, a multimodal deep convolutional neural network (CNN) and bidirectional long short-term memory (BiLSTM) network are proposed, which fuses the audio and visual cues in a deep model. The spatial and temporal features extracted from video frames are fused with short-term Fourier transform (STFT) extracted from audio signals. Finally, a Softmax classifier is used to classify inputs into seven groups: anger, disgust, fear, happiness, sadness, surprise, and neutral mode. The proposed model is evaluated on Surrey Audio-Visual Expressed Emotion (SAVEE) database with an accuracy of 95.48%. Our experimental study reveals that the suggested method is more effective than existing algorithms in adapting to emotion recognition in this dataset.

查看原文本刊更多论文

基于k均值聚类和时空CNN的视听情感识别

情感识别是一项具有挑战性的任务，因为主观感受与低层次视听特征之间存在情感差距。因此，开发一种可行的高性能情感识别方法可能会增强人机交互。与其他现有方法相比，深度学习方法提高了情绪识别系统的性能。本文提出了一种多模态深度卷积神经网络(CNN)和双向长短期记忆(BiLSTM)网络，将音频和视觉线索融合在一个深度模型中。将从视频帧中提取的时空特征与从音频信号中提取的短时傅里叶变换(STFT)相融合。最后，使用Softmax分类器将输入分为七组:愤怒、厌恶、恐惧、快乐、悲伤、惊讶和中性模式。在Surrey视听表达情感数据库(SAVEE)上对该模型进行了评价，准确率达到95.48%。我们的实验研究表明，该方法比现有算法更有效地适应该数据集的情绪识别。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2023 6th International Conference on Pattern Recognition and Image Analysis (IPRIA)

自引率

0.00%

发文量