An Efficient Temporal Feature Aggregation of Audio-Video Signals for Human Emotion Recognition

2021 6th International Conference on Signal Processing, Computing and Control (ISPCC) Pub Date : 2021-10-07 DOI:10.1109/ISPCC53510.2021.9609528

Lovejit Singh, Sarbjeet Singh, N. Aggarwal, Ranjit Singh, Gagan Singla

{"title":"An Efficient Temporal Feature Aggregation of Audio-Video Signals for Human Emotion Recognition","authors":"Lovejit Singh, Sarbjeet Singh, N. Aggarwal, Ranjit Singh, Gagan Singla","doi":"10.1109/ISPCC53510.2021.9609528","DOIUrl":null,"url":null,"abstract":"Due to the significance of human behavioral intelligence in computing devices, this work focused on the facial expressions and speech of humans for their emotion recognition in multimodal (audio-video) signals. The audio-video signals consist of frames to represent the temporal activities of facial expressions and speech of humans. It become challenging to determine the efficient method to construct a spatial and temporal feature vector from the frame-wise spatial feature descriptor to describe the facial expressions and speech temporal information in audio-video signals. In this paper, an efficient temporal feature aggregation method is presented for human emotion recognition in audio-video signals. The Local Binary Pattern (LBP) feature of facial expressions and Mel Frequency Cepstral Coefficients (MFCCs) and its $\\Delta+\\Delta\\Delta$ of speech are computed from each frame. The experiment analysis is performed to decide the efficient method for temporal feature aggregation, i.e., sum normalization or statistical functions, to construct a spatial and temporal feature vector. The multiclass Support Vector Machine (SVM) classification model is trained and tested to evaluate the performance of temporal feature aggregation method with LBP features and MFCCs and its $\\Delta+\\Delta\\Delta$ features. The Bayesian optimization (BO) method determines the optimal hyper-parameters of the multiclass SVM classifier for emotion detection. The experiment analysis of proposed work is performed on publicly accessible and challenging Crowd-sourced Emotional Multimodal Actors-Dataset (CREMA-D) and compared with existing work.","PeriodicalId":113266,"journal":{"name":"2021 6th International Conference on Signal Processing, Computing and Control (ISPCC)","volume":"25 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-10-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 6th International Conference on Signal Processing, Computing and Control (ISPCC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ISPCC53510.2021.9609528","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

Abstract

Due to the significance of human behavioral intelligence in computing devices, this work focused on the facial expressions and speech of humans for their emotion recognition in multimodal (audio-video) signals. The audio-video signals consist of frames to represent the temporal activities of facial expressions and speech of humans. It become challenging to determine the efficient method to construct a spatial and temporal feature vector from the frame-wise spatial feature descriptor to describe the facial expressions and speech temporal information in audio-video signals. In this paper, an efficient temporal feature aggregation method is presented for human emotion recognition in audio-video signals. The Local Binary Pattern (LBP) feature of facial expressions and Mel Frequency Cepstral Coefficients (MFCCs) and its $\Delta+\Delta\Delta$ of speech are computed from each frame. The experiment analysis is performed to decide the efficient method for temporal feature aggregation, i.e., sum normalization or statistical functions, to construct a spatial and temporal feature vector. The multiclass Support Vector Machine (SVM) classification model is trained and tested to evaluate the performance of temporal feature aggregation method with LBP features and MFCCs and its $\Delta+\Delta\Delta$ features. The Bayesian optimization (BO) method determines the optimal hyper-parameters of the multiclass SVM classifier for emotion detection. The experiment analysis of proposed work is performed on publicly accessible and challenging Crowd-sourced Emotional Multimodal Actors-Dataset (CREMA-D) and compared with existing work.

查看原文本刊更多论文

一种用于人类情感识别的音频-视频信号的有效时间特征聚合

由于人类行为智能在计算设备中的重要性，本工作主要关注人类的面部表情和语音，以进行多模态(音频-视频)信号中的情感识别。音频-视频信号由帧组成，以表示人类面部表情和语言的时间活动。如何从分帧的空间特征描述符中构造时空特征向量来描述音视频信号中的面部表情和语音时间信息，成为一个具有挑战性的问题。本文提出了一种有效的时间特征聚合方法，用于音视频信号中的人类情感识别。从每一帧开始计算面部表情的局部二值模式(LBP)特征和Mel频率倒谱系数(MFCCs)及其$\Delta+\Delta\Delta$语音。通过实验分析，确定时间特征聚合的有效方法，即和归一化或统计函数，以构建时空特征向量。对多类支持向量机(SVM)分类模型进行训练和测试，以评估LBP特征和mfccc及其$\Delta+\Delta\Delta$特征的时间特征聚合方法的性能。贝叶斯优化(BO)方法确定了用于情感检测的多类SVM分类器的最优超参数。在可公开访问和具有挑战性的众包情感多模态行为者数据集(CREMA-D)上对拟议工作进行实验分析，并与现有工作进行比较。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2021 6th International Conference on Signal Processing, Computing and Control (ISPCC)

自引率

0.00%

发文量