Deep Imbalanced Learning for Multimodal Emotion Recognition in Conversations

IEEE transactions on artificial intelligence Pub Date : 2024-08-19 DOI:10.1109/TAI.2024.3445325

Tao Meng;Yuntao Shou;Wei Ai;Nan Yin;Keqin Li

{"title":"Deep Imbalanced Learning for Multimodal Emotion Recognition in Conversations","authors":"Tao Meng;Yuntao Shou;Wei Ai;Nan Yin;Keqin Li","doi":"10.1109/TAI.2024.3445325","DOIUrl":null,"url":null,"abstract":"The main task of multimodal emotion recognition in conversations (MERC) is to identify the emotions in modalities, e.g., text, audio, image, and video, which is a significant development direction for realizing machine intelligence. However, many data in MERC naturally exhibit an imbalanced distribution of emotion categories, and researchers ignore the negative impact of imbalanced data on emotion recognition. To tackle this problem, we systematically analyze it from three aspects: data augmentation, loss sensitivity, and sampling strategy, and propose the class boundary enhanced representation learning (CBERL) model. Concretely, we first design a multimodal generative adversarial network to address the imbalanced distribution of emotion categories in raw data. Second, a deep joint variational autoencoder is proposed to fuse complementary semantic information across modalities and obtain discriminative feature representations. Finally, we implement a multitask graph neural network with mask reconstruction and classification optimization to solve the problem of overfitting and underfitting in class boundary learning and achieve cross-modal emotion recognition. We have conducted extensive experiments on the interactive emotional dyadic motion capture (IEMOCAP) and multimodal emotion lines dataset (MELD) benchmark datasets, and the results show that CBERL has achieved a certain performance improvement in the effectiveness of emotion recognition. Especially on the minority class “fear” and “disgust” emotion labels, our model improves the accuracy and F1 value by 10% to 20%. Our code is publicly available at \n<uri>https://github.com/yuntaoshou/CBERL</uri>\n.","PeriodicalId":73305,"journal":{"name":"IEEE transactions on artificial intelligence","volume":"5 12","pages":"6472-6487"},"PeriodicalIF":0.0000,"publicationDate":"2024-08-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on artificial intelligence","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10639357/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

The main task of multimodal emotion recognition in conversations (MERC) is to identify the emotions in modalities, e.g., text, audio, image, and video, which is a significant development direction for realizing machine intelligence. However, many data in MERC naturally exhibit an imbalanced distribution of emotion categories, and researchers ignore the negative impact of imbalanced data on emotion recognition. To tackle this problem, we systematically analyze it from three aspects: data augmentation, loss sensitivity, and sampling strategy, and propose the class boundary enhanced representation learning (CBERL) model. Concretely, we first design a multimodal generative adversarial network to address the imbalanced distribution of emotion categories in raw data. Second, a deep joint variational autoencoder is proposed to fuse complementary semantic information across modalities and obtain discriminative feature representations. Finally, we implement a multitask graph neural network with mask reconstruction and classification optimization to solve the problem of overfitting and underfitting in class boundary learning and achieve cross-modal emotion recognition. We have conducted extensive experiments on the interactive emotional dyadic motion capture (IEMOCAP) and multimodal emotion lines dataset (MELD) benchmark datasets, and the results show that CBERL has achieved a certain performance improvement in the effectiveness of emotion recognition. Especially on the minority class “fear” and “disgust” emotion labels, our model improves the accuracy and F1 value by 10% to 20%. Our code is publicly available at https://github.com/yuntaoshou/CBERL .

查看原文本刊更多论文

对话中多模态情绪识别的深度不平衡学习

对话中的多模态情感识别（MERC）的主要任务是识别文本、音频、图像和视频等模态中的情感，这是实现机器智能的重要发展方向。然而，MERC中的许多数据自然表现出情绪类别分布的不平衡，而研究者忽视了数据不平衡对情绪识别的负面影响。为了解决这一问题，我们从数据增强、损失灵敏度和采样策略三个方面进行了系统的分析，并提出了类边界增强表示学习（CBERL）模型。具体来说，我们首先设计了一个多模态生成对抗网络来解决原始数据中情绪类别分布的不平衡问题。其次，提出一种深度联合变分自编码器，融合各模态间的互补语义信息，获得判别特征表示。最后，通过掩模重构和分类优化实现了多任务图神经网络，解决了类边界学习中的过拟合和欠拟合问题，实现了跨模态情感识别。我们在交互式情绪二元动作捕捉（IEMOCAP）和多模态情绪线数据集（MELD）基准数据集上进行了大量的实验，结果表明CBERL在情绪识别的有效性上取得了一定的性能提升。特别是在少数类“恐惧”和“厌恶”情绪标签上，我们的模型将准确率和F1值提高了10%到20%。我们的代码可以在https://github.com/yuntaoshou/CBERL上公开获得。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE transactions on artificial intelligence

CiteScore

7.70

自引率

0.00%

发文量