基于跨模态变换器的交互式协同学习在视听情感识别中的应用

Interspeech Pub Date : 2022-09-18 DOI:10.21437/interspeech.2022-11307

Akihiko Takashima, Ryo Masumura, Atsushi Ando, Yoshihiro Yamazaki, Mihiro Uchida, Shota Orihashi

{"title":"基于跨模态变换器的交互式协同学习在视听情感识别中的应用","authors":"Akihiko Takashima, Ryo Masumura, Atsushi Ando, Yoshihiro Yamazaki, Mihiro Uchida, Shota Orihashi","doi":"10.21437/interspeech.2022-11307","DOIUrl":null,"url":null,"abstract":"This paper proposes a novel modeling method for audio-visual emotion recognition. Since human emotions are expressed multi-modally, jointly capturing audio and visual cues is a potentially promising approach. In conventional multi-modal modeling methods, a recognition model was trained from an audio-visual paired dataset so as to only enhance audio-visual emotion recognition performance. However, it fails to estimate emotions from single-modal inputs, which indicates they are degraded by overfitting the combinations of the individual modal features. Our supposition is that the ideal form of the emotion recognition is to accurately perform both audio-visual multimodal processing and single-modal processing with a single model. This is expected to promote utilization of individual modal knowledge for improving audio-visual emotion recognition. Therefore, our proposed method employs a cross-modal transformer model that enables different types of inputs to be handled. In addition, we introduce a novel training method named interactive co-learning; it allows the model to learn knowledge from both and either of the modals. Experiments on a multi-label emotion recognition task demonstrate the ef-fectiveness of the proposed method.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"4740-4744"},"PeriodicalIF":0.0000,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Interactive Co-Learning with Cross-Modal Transformer for Audio-Visual Emotion Recognition\",\"authors\":\"Akihiko Takashima, Ryo Masumura, Atsushi Ando, Yoshihiro Yamazaki, Mihiro Uchida, Shota Orihashi\",\"doi\":\"10.21437/interspeech.2022-11307\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This paper proposes a novel modeling method for audio-visual emotion recognition. Since human emotions are expressed multi-modally, jointly capturing audio and visual cues is a potentially promising approach. In conventional multi-modal modeling methods, a recognition model was trained from an audio-visual paired dataset so as to only enhance audio-visual emotion recognition performance. However, it fails to estimate emotions from single-modal inputs, which indicates they are degraded by overfitting the combinations of the individual modal features. Our supposition is that the ideal form of the emotion recognition is to accurately perform both audio-visual multimodal processing and single-modal processing with a single model. This is expected to promote utilization of individual modal knowledge for improving audio-visual emotion recognition. Therefore, our proposed method employs a cross-modal transformer model that enables different types of inputs to be handled. In addition, we introduce a novel training method named interactive co-learning; it allows the model to learn knowledge from both and either of the modals. Experiments on a multi-label emotion recognition task demonstrate the ef-fectiveness of the proposed method.\",\"PeriodicalId\":73500,\"journal\":{\"name\":\"Interspeech\",\"volume\":\"1 1\",\"pages\":\"4740-4744\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-09-18\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Interspeech\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.21437/interspeech.2022-11307\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Interspeech","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.21437/interspeech.2022-11307","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

本文提出了一种新的视听情感识别建模方法。由于人类的情绪是以多种方式表达的，因此联合捕捉音频和视觉线索是一种潜在的有前景的方法。在传统的多模态建模方法中，识别模型是从视听配对数据集中训练出来的，目的是只提高视听情感识别的性能。然而，它无法从单个模态输入中估计情绪，这表明它们是通过过度拟合单个模态特征的组合而退化的。我们的假设是，情感识别的理想形式是用一个模型准确地进行视听多模态处理和单模态处理。这有望促进个体模态知识的利用，以提高视听情感识别。因此，我们提出的方法采用了一个跨模态变换器模型，该模型能够处理不同类型的输入。此外，我们还介绍了一种新的训练方法——交互式共同学习；它允许模型从两个模态和任意一个模态学习知识。在多标签情感识别任务上的实验证明了该方法的有效性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Interactive Co-Learning with Cross-Modal Transformer for Audio-Visual Emotion Recognition

This paper proposes a novel modeling method for audio-visual emotion recognition. Since human emotions are expressed multi-modally, jointly capturing audio and visual cues is a potentially promising approach. In conventional multi-modal modeling methods, a recognition model was trained from an audio-visual paired dataset so as to only enhance audio-visual emotion recognition performance. However, it fails to estimate emotions from single-modal inputs, which indicates they are degraded by overfitting the combinations of the individual modal features. Our supposition is that the ideal form of the emotion recognition is to accurately perform both audio-visual multimodal processing and single-modal processing with a single model. This is expected to promote utilization of individual modal knowledge for improving audio-visual emotion recognition. Therefore, our proposed method employs a cross-modal transformer model that enables different types of inputs to be handled. In addition, we introduce a novel training method named interactive co-learning; it allows the model to learn knowledge from both and either of the modals. Experiments on a multi-label emotion recognition task demonstrate the ef-fectiveness of the proposed method.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Interspeech

自引率

0.00%

发文量