Masked Graph Learning With Recurrent Alignment for Multimodal Emotion Recognition in Conversation

IF 4.1 2区计算机科学 Q1 ACOUSTICS

IEEE/ACM Transactions on Audio, Speech, and Language Processing Pub Date : 2024-07-26 DOI:10.1109/TASLP.2024.3434495

Tao Meng;Fuchen Zhang;Yuntao Shou;Hongen Shao;Wei Ai;Keqin Li

{"title":"Masked Graph Learning With Recurrent Alignment for Multimodal Emotion Recognition in Conversation","authors":"Tao Meng;Fuchen Zhang;Yuntao Shou;Hongen Shao;Wei Ai;Keqin Li","doi":"10.1109/TASLP.2024.3434495","DOIUrl":null,"url":null,"abstract":"Since Multimodal Emotion Recognition in Conversation (MERC) can be applied to public opinion monitoring, intelligent dialogue robots, and other fields, it has received extensive research attention in recent years. Unlike traditional unimodal emotion recognition, MERC can fuse complementary semantic information between multiple modalities (e.g., text, audio, and vision) to improve emotion recognition. However, previous work ignored the inter-modal alignment process and the intra-modal noise information before multimodal fusion but directly fuses multimodal features, which will hinder the model for representation learning. In this study, we have developed a novel approach called Masked Graph Learning with Recursive Alignment (MGLRA) to tackle this problem, which uses a recurrent iterative module with memory to align multimodal features, and then uses the masked GCN for multimodal feature fusion. First, we employ LSTM to capture contextual information and use a graph attention-filtering mechanism to eliminate noise effectively within the modality. Second, we build a recurrent iteration module with a memory function, which can use communication between different modalities to eliminate the gap between modalities and achieve the preliminary alignment of features between modalities. Then, a cross-modal multi-head attention mechanism is introduced to achieve feature alignment between modalities and construct a masked GCN for multimodal feature fusion, which can perform random mask reconstruction on the nodes in the graph to obtain better node feature representation. Finally, we utilize a multilayer perceptron (MLP) for emotion recognition. Extensive experiments on two benchmark datasets (i.e., IEMOCAP and MELD) demonstrate that MGLRA outperforms state-of-the-art methods.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"4298-4312"},"PeriodicalIF":4.1000,"publicationDate":"2024-07-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10612252/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ACOUSTICS","Score":null,"Total":0}

引用次数: 0

Abstract

Since Multimodal Emotion Recognition in Conversation (MERC) can be applied to public opinion monitoring, intelligent dialogue robots, and other fields, it has received extensive research attention in recent years. Unlike traditional unimodal emotion recognition, MERC can fuse complementary semantic information between multiple modalities (e.g., text, audio, and vision) to improve emotion recognition. However, previous work ignored the inter-modal alignment process and the intra-modal noise information before multimodal fusion but directly fuses multimodal features, which will hinder the model for representation learning. In this study, we have developed a novel approach called Masked Graph Learning with Recursive Alignment (MGLRA) to tackle this problem, which uses a recurrent iterative module with memory to align multimodal features, and then uses the masked GCN for multimodal feature fusion. First, we employ LSTM to capture contextual information and use a graph attention-filtering mechanism to eliminate noise effectively within the modality. Second, we build a recurrent iteration module with a memory function, which can use communication between different modalities to eliminate the gap between modalities and achieve the preliminary alignment of features between modalities. Then, a cross-modal multi-head attention mechanism is introduced to achieve feature alignment between modalities and construct a masked GCN for multimodal feature fusion, which can perform random mask reconstruction on the nodes in the graph to obtain better node feature representation. Finally, we utilize a multilayer perceptron (MLP) for emotion recognition. Extensive experiments on two benchmark datasets (i.e., IEMOCAP and MELD) demonstrate that MGLRA outperforms state-of-the-art methods.

查看原文本刊更多论文

利用递归对齐进行掩蔽图学习，实现对话中的多模态情感识别

由于会话中的多模态情感识别（MERC）可应用于舆情监测、智能对话机器人等领域，近年来受到了广泛的研究关注。与传统的单模态情感识别不同，MERC 可以融合多种模态（如文本、音频和视觉）之间互补的语义信息来提高情感识别率。然而，以往的研究忽略了多模态融合前的模态间配准过程和模态内噪声信息，而是直接融合多模态特征，这将阻碍模型的表征学习。在本研究中，我们开发了一种名为 "带递归配准的掩码图学习"（Masked Graph Learning with Recursive Alignment，MGLRA）的新方法来解决这一问题，该方法使用带记忆的递归迭代模块来配准多模态特征，然后使用掩码 GCN 进行多模态特征融合。首先，我们采用 LSTM 捕捉上下文信息，并使用图注意力过滤机制有效消除模态内的噪声。其次，我们建立了一个具有记忆功能的循环迭代模块，它可以利用不同模态之间的通信消除模态之间的差距，实现模态之间特征的初步对齐。然后，引入跨模态多头注意力机制，实现模态间的特征对齐，并构建用于多模态特征融合的掩码 GCN，它可以对图中的节点进行随机掩码重构，以获得更好的节点特征表示。最后，我们利用多层感知器（MLP）进行情感识别。在两个基准数据集（即 IEMOCAP 和 MELD）上进行的广泛实验表明，MGLRA 的性能优于最先进的方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE/ACM Transactions on Audio, Speech, and Language Processing ACOUSTICS-ENGINEERING, ELECTRICAL & ELECTRONIC

CiteScore

11.30

自引率

11.10%

发文量

217

期刊介绍： The IEEE/ACM Transactions on Audio, Speech, and Language Processing covers audio, speech and language processing and the sciences that support them. In audio processing: transducers, room acoustics, active sound control, human audition, analysis/synthesis/coding of music, and consumer audio. In speech processing: areas such as speech analysis, synthesis, coding, speech and speaker recognition, speech production and perception, and speech enhancement. In language processing: speech and text analysis, understanding, generation, dialog management, translation, summarization, question answering and document indexing and retrieval, as well as general language modeling.