Deep Neural Networks for Comprehensive Multimodal Emotion Recognition

2023 International Conference on Disruptive Technologies (ICDT) Pub Date : 2023-05-11 DOI:10.1109/ICDT57929.2023.10150945

Ashutosh Tiwari, Satyam Kumar, Tushar Mehrotra, Rajneesh Kumar Singh

{"title":"Deep Neural Networks for Comprehensive Multimodal Emotion Recognition","authors":"Ashutosh Tiwari, Satyam Kumar, Tushar Mehrotra, Rajneesh Kumar Singh","doi":"10.1109/ICDT57929.2023.10150945","DOIUrl":null,"url":null,"abstract":"Emotions may be expressed in many different ways, making automatic affect recognition challenging. Several industries may benefit from this technology, including audiovisual search and human- machine interface. Recently, neural networks have been developed to assess emotional states with unprecedented accuracy. We provide an approach to emotion identification that makes use of both visual and aural signals. It’s crucial to isolate relevant features in order to accurately represent the nuanced emotions conveyed in a wide range of speech patterns. We do this by using a Convolutional Neural Network (CNN) to parse the audio track for feature extraction and a 50-layer deep ResNet to process the visual track. Machine learning algorithms, in addition to needing to extract the characteristics, should also be robust against outliers and reflective of their surroundings. To solve this problem, LSTM networks are used. We train the system from the ground up, using the RECOLA datasets from the AVEC 2016 emotion recognition research challenge, and we demonstrate that our method is superior to prior approaches that relied on manually constructed aural and visual cues for identifying genuine emotional states. It has been demonstrated that the visual modality predicts valence more accurately than arousal. The best results for the valence dimension from the RECOLA dataset are shown in Table III below.","PeriodicalId":266681,"journal":{"name":"2023 International Conference on Disruptive Technologies (ICDT)","volume":"51 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-05-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 International Conference on Disruptive Technologies (ICDT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDT57929.2023.10150945","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Emotions may be expressed in many different ways, making automatic affect recognition challenging. Several industries may benefit from this technology, including audiovisual search and human- machine interface. Recently, neural networks have been developed to assess emotional states with unprecedented accuracy. We provide an approach to emotion identification that makes use of both visual and aural signals. It’s crucial to isolate relevant features in order to accurately represent the nuanced emotions conveyed in a wide range of speech patterns. We do this by using a Convolutional Neural Network (CNN) to parse the audio track for feature extraction and a 50-layer deep ResNet to process the visual track. Machine learning algorithms, in addition to needing to extract the characteristics, should also be robust against outliers and reflective of their surroundings. To solve this problem, LSTM networks are used. We train the system from the ground up, using the RECOLA datasets from the AVEC 2016 emotion recognition research challenge, and we demonstrate that our method is superior to prior approaches that relied on manually constructed aural and visual cues for identifying genuine emotional states. It has been demonstrated that the visual modality predicts valence more accurately than arousal. The best results for the valence dimension from the RECOLA dataset are shown in Table III below.

查看原文本刊更多论文

综合多模态情绪识别的深度神经网络

情绪可能以许多不同的方式表达，这使得自动情感识别具有挑战性。包括视听搜索和人机界面在内的一些行业可能会从这项技术中受益。最近，神经网络已经发展到以前所未有的准确性评估情绪状态。我们提供了一种利用视觉和听觉信号的情感识别方法。为了准确地表达各种语言模式中微妙的情感，分离出相关的特征是至关重要的。我们使用卷积神经网络(CNN)来解析音轨进行特征提取，并使用50层深度ResNet来处理视觉轨道。机器学习算法除了需要提取特征外，还应该对异常值具有鲁棒性，并反映其周围环境。为了解决这个问题，使用了LSTM网络。我们从头开始训练系统，使用来自AVEC 2016情绪识别研究挑战的RECOLA数据集，我们证明了我们的方法优于之前依赖于手动构建的听觉和视觉线索来识别真实情绪状态的方法。已经证明，视觉模态比唤醒更准确地预测效价。来自RECOLA数据集的价维的最佳结果如下表III所示。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2023 International Conference on Disruptive Technologies (ICDT)

自引率

0.00%

发文量