Multimodal Decoupled Distillation Graph Neural Network for Emotion Recognition in Conversation

IF 11.1 1区工程技术 Q1 ENGINEERING, ELECTRICAL & ELECTRONIC

IEEE Transactions on Circuits and Systems for Video Technology Pub Date : 2024-03-27 DOI:10.1109/TCSVT.2024.3405406

Yijing Dai;Yingjian Li;Dongpeng Chen;Jinxing Li;Guangming Lu

{"title":"Multimodal Decoupled Distillation Graph Neural Network for Emotion Recognition in Conversation","authors":"Yijing Dai;Yingjian Li;Dongpeng Chen;Jinxing Li;Guangming Lu","doi":"10.1109/TCSVT.2024.3405406","DOIUrl":null,"url":null,"abstract":"Graph Neural Networks (GNNs) have attracted increasing attentions for multimodal Emotion Recognition in Conversation (ERC) due to their good performance in contextual understanding. However, most existing GNN-based methods suffer from two challenges: 1) How to explore and propagate appropriate information in a conversational graph. Typical GNNs in ERC neglect to mine the emotion commonality and discrepancy in the local neighborhood, leading to learn similar embbedings for connected nodes. However, the embeddings of these connected nodes are supposed to be distinguishable as they belong to different speakers with different emotions. 2) Most existing works apply simple concatenation or co-occurrence prior for modality combination, failing to fully capture the emotional information of multiple modalities in relationship modeling. In this paper, we propose a multimodal Decoupled Distillation Graph Neural Network (D2GNN) to address the above challenges. Specifically, D2GNN decouples the input features into emotion-aware and emotion-agnostic ones on the emotion category-level, aiming to capture emotion commonality and implicit emotion information, respectively. Moreover, we design a new message passing mechanism to separately propagate emotion-aware and -agnostic knowledge between nodes according to speaker dependency in two GNN-based modules, exploring the correlations of utterances and alleviating the similarities of embeddings. Furthermore, a multimodal distillation unit is performed to obtain the distinguishable embeddings by aggregating unimodal decoupled features. Experimental results on two ERC benchmarks demonstrate the superiority of the proposed model. Code is available at \n<uri>https://github.com/gityider/D2GNN</uri>\n.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"34 10","pages":"9910-9924"},"PeriodicalIF":11.1000,"publicationDate":"2024-03-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Circuits and Systems for Video Technology","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10539116/","RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}

引用次数: 0

Abstract

Graph Neural Networks (GNNs) have attracted increasing attentions for multimodal Emotion Recognition in Conversation (ERC) due to their good performance in contextual understanding. However, most existing GNN-based methods suffer from two challenges: 1) How to explore and propagate appropriate information in a conversational graph. Typical GNNs in ERC neglect to mine the emotion commonality and discrepancy in the local neighborhood, leading to learn similar embbedings for connected nodes. However, the embeddings of these connected nodes are supposed to be distinguishable as they belong to different speakers with different emotions. 2) Most existing works apply simple concatenation or co-occurrence prior for modality combination, failing to fully capture the emotional information of multiple modalities in relationship modeling. In this paper, we propose a multimodal Decoupled Distillation Graph Neural Network (D2GNN) to address the above challenges. Specifically, D2GNN decouples the input features into emotion-aware and emotion-agnostic ones on the emotion category-level, aiming to capture emotion commonality and implicit emotion information, respectively. Moreover, we design a new message passing mechanism to separately propagate emotion-aware and -agnostic knowledge between nodes according to speaker dependency in two GNN-based modules, exploring the correlations of utterances and alleviating the similarities of embeddings. Furthermore, a multimodal distillation unit is performed to obtain the distinguishable embeddings by aggregating unimodal decoupled features. Experimental results on two ERC benchmarks demonstrate the superiority of the proposed model. Code is available at https://github.com/gityider/D2GNN .

查看原文本刊更多论文

用于对话中情感识别的多模态解耦蒸馏图神经网络

图神经网络（GNN）因其在上下文理解方面的良好表现，在多模态会话情感识别（ERC）中吸引了越来越多的关注。然而，大多数现有的基于 GNN 的方法都面临两个挑战：1) 如何在对话图中探索和传播适当的信息。典型的 GNN 在 ERC 中忽略了挖掘本地邻域中的情感共性和差异，从而导致为连接节点学习相似的嵌入。然而，这些连接节点的嵌入应该是可区分的，因为它们属于不同情感的不同说话者。2) 现有研究大多采用简单的并集或共现先验进行模态组合，无法在关系建模中充分捕捉多种模态的情感信息。在本文中，我们提出了一种多模态解耦蒸馏图神经网络（D2GNN）来应对上述挑战。具体来说，D2GNN 在情感类别层面上将输入特征解耦为情感感知特征和情感无关特征，旨在分别捕捉情感共性和隐含情感信息。此外，我们还设计了一种新的信息传递机制，在基于 GNN 的两个模块中，根据说话者的依赖性在节点之间分别传播情感感知知识和情感无关知识，从而探索语篇的相关性并减轻嵌入的相似性。此外，还执行了一个多模态蒸馏单元，通过聚合单模态解耦特征来获得可区分的嵌入。在两个 ERC 基准上的实验结果证明了所提模型的优越性。代码见 https://github.com/gityider/D2GNN。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Transactions on Circuits and Systems for Video Technology 工程技术-工程：电子与电气

CiteScore

13.80

自引率

27.40%

发文量

660

审稿时长

5 months

期刊介绍： The IEEE Transactions on Circuits and Systems for Video Technology (TCSVT) is dedicated to covering all aspects of video technologies from a circuits and systems perspective. We encourage submissions of general, theoretical, and application-oriented papers related to image and video acquisition, representation, presentation, and display. Additionally, we welcome contributions in areas such as processing, filtering, and transforms; analysis and synthesis; learning and understanding; compression, transmission, communication, and networking; as well as storage, retrieval, indexing, and search. Furthermore, papers focusing on hardware and software design and implementation are highly valued. Join us in advancing the field of video technology through innovative research and insights.