{"title":"Emotion recognition using multimodal matchmap fusion and multi-task learning","authors":"Ricardo Pizarro, Juan Bekios-Calfa","doi":"10.1049/icp.2021.1454","DOIUrl":null,"url":null,"abstract":"Emotion recognition is a complex task due to the great intraclass and inter-class variability that exists implicitly in the problem. From the point of view of the intra-class, an emotion can be expressed by different people, which generates different representations of it. For the inter-class case, there are some kinds of emotions that are alike. Traditionally, the problem has been approached in different ways, highlighting the analysis of images to determine the facial expression of a person to extrapolate it to a type of emotion, also, the use of audio sequences to estimate the emotion of the speaker. The present work seeks to solve this problem using multimodal techniques, multitask and Deep Learning. To help with these problems, the use of a fusion method based on the similarity between audio and video modalities will be investigated and applied to the emotion classification problem. The use of this method allows the use of auxiliary tasks that enhance the learned relationships between the emotions shown in video frames and audio frames belonging to the same emotion label and punish those that are different. The results show that when using the fusion method based on the similarity of modalities together with the use of multiple tasks, the classification is improved by 7% with respect to the classification obtained in the baseline model that uses concatenation of the characteristics of each modality, the experiments are performed on the Interactive Emotional Dyadic Motion Capture (IEMOCAP) database.","PeriodicalId":431144,"journal":{"name":"11th International Conference of Pattern Recognition Systems (ICPRS 2021)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"11th International Conference of Pattern Recognition Systems (ICPRS 2021)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1049/icp.2021.1454","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1
Abstract
Emotion recognition is a complex task due to the great intraclass and inter-class variability that exists implicitly in the problem. From the point of view of the intra-class, an emotion can be expressed by different people, which generates different representations of it. For the inter-class case, there are some kinds of emotions that are alike. Traditionally, the problem has been approached in different ways, highlighting the analysis of images to determine the facial expression of a person to extrapolate it to a type of emotion, also, the use of audio sequences to estimate the emotion of the speaker. The present work seeks to solve this problem using multimodal techniques, multitask and Deep Learning. To help with these problems, the use of a fusion method based on the similarity between audio and video modalities will be investigated and applied to the emotion classification problem. The use of this method allows the use of auxiliary tasks that enhance the learned relationships between the emotions shown in video frames and audio frames belonging to the same emotion label and punish those that are different. The results show that when using the fusion method based on the similarity of modalities together with the use of multiple tasks, the classification is improved by 7% with respect to the classification obtained in the baseline model that uses concatenation of the characteristics of each modality, the experiments are performed on the Interactive Emotional Dyadic Motion Capture (IEMOCAP) database.