Emotion recognition using multimodal matchmap fusion and multi-task learning

11th International Conference of Pattern Recognition Systems (ICPRS 2021) Pub Date : 1900-01-01 DOI:10.1049/icp.2021.1454

Ricardo Pizarro, Juan Bekios-Calfa

{"title":"Emotion recognition using multimodal matchmap fusion and multi-task learning","authors":"Ricardo Pizarro, Juan Bekios-Calfa","doi":"10.1049/icp.2021.1454","DOIUrl":null,"url":null,"abstract":"Emotion recognition is a complex task due to the great intraclass and inter-class variability that exists implicitly in the problem. From the point of view of the intra-class, an emotion can be expressed by different people, which generates different representations of it. For the inter-class case, there are some kinds of emotions that are alike. Traditionally, the problem has been approached in different ways, highlighting the analysis of images to determine the facial expression of a person to extrapolate it to a type of emotion, also, the use of audio sequences to estimate the emotion of the speaker. The present work seeks to solve this problem using multimodal techniques, multitask and Deep Learning. To help with these problems, the use of a fusion method based on the similarity between audio and video modalities will be investigated and applied to the emotion classification problem. The use of this method allows the use of auxiliary tasks that enhance the learned relationships between the emotions shown in video frames and audio frames belonging to the same emotion label and punish those that are different. The results show that when using the fusion method based on the similarity of modalities together with the use of multiple tasks, the classification is improved by 7% with respect to the classification obtained in the baseline model that uses concatenation of the characteristics of each modality, the experiments are performed on the Interactive Emotional Dyadic Motion Capture (IEMOCAP) database.","PeriodicalId":431144,"journal":{"name":"11th International Conference of Pattern Recognition Systems (ICPRS 2021)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"11th International Conference of Pattern Recognition Systems (ICPRS 2021)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1049/icp.2021.1454","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

Emotion recognition is a complex task due to the great intraclass and inter-class variability that exists implicitly in the problem. From the point of view of the intra-class, an emotion can be expressed by different people, which generates different representations of it. For the inter-class case, there are some kinds of emotions that are alike. Traditionally, the problem has been approached in different ways, highlighting the analysis of images to determine the facial expression of a person to extrapolate it to a type of emotion, also, the use of audio sequences to estimate the emotion of the speaker. The present work seeks to solve this problem using multimodal techniques, multitask and Deep Learning. To help with these problems, the use of a fusion method based on the similarity between audio and video modalities will be investigated and applied to the emotion classification problem. The use of this method allows the use of auxiliary tasks that enhance the learned relationships between the emotions shown in video frames and audio frames belonging to the same emotion label and punish those that are different. The results show that when using the fusion method based on the similarity of modalities together with the use of multiple tasks, the classification is improved by 7% with respect to the classification obtained in the baseline model that uses concatenation of the characteristics of each modality, the experiments are performed on the Interactive Emotional Dyadic Motion Capture (IEMOCAP) database.

查看原文本刊更多论文

基于多模态匹配图融合和多任务学习的情绪识别

情绪识别是一项复杂的任务，因为该问题隐含着巨大的班级内和班级间的变异性。从阶级内部的角度来看，一种情感可以由不同的人来表达，从而产生不同的表征。对于跨阶级的情况，有一些情感是相似的。传统上，这个问题已经用不同的方法来解决，强调分析图像来确定一个人的面部表情，并将其推断为一种情绪，同时，使用音频序列来估计说话者的情绪。目前的工作试图使用多模态技术、多任务和深度学习来解决这个问题。为了帮助解决这些问题，我们将研究基于音频和视频模式相似性的融合方法，并将其应用于情感分类问题。这种方法的使用允许使用辅助任务来增强属于同一情绪标签的视频帧和音频帧中显示的情绪之间的习得关系，并惩罚那些不同的情绪。结果表明，基于模态相似性的融合方法与多任务结合使用时，分类效率比基于各模态特征拼接的基线模型提高了7%，并在交互式情绪二元动作捕捉(IEMOCAP)数据库上进行了实验。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

11th International Conference of Pattern Recognition Systems (ICPRS 2021)

自引率

0.00%

发文量