Text-centric disentangled representation interaction network for Multimodal Sentiment Analysis

IF 5.5 2区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE
Jingming Hou , Nazlia Omar , Sabrina Tiun , Saidah Saad , Qian He
{"title":"Text-centric disentangled representation interaction network for Multimodal Sentiment Analysis","authors":"Jingming Hou ,&nbsp;Nazlia Omar ,&nbsp;Sabrina Tiun ,&nbsp;Saidah Saad ,&nbsp;Qian He","doi":"10.1016/j.neucom.2025.130857","DOIUrl":null,"url":null,"abstract":"<div><div>With the rise of short video content, Multimodal Sentiment Analysis (MSA) has gained significant attention as a research hotspot. However, the issue of heterogeneity has emerged as a major challenge in fusing these three modalities. While some recent studies have attempted to reduce this problem of heterogeneity by disentangling the modalities, they overlooked two critical issues. First, their approach treats all three modalities equally during disentanglement, overlooking the central role of the text modality in MSA. As the primary carrier of semantic and emotional information, the text modality serves as the backbone for sentiment interpretation and multimodal fusion. Additionally, after disentangling the modalities, they do not effectively leverage the unique features of each modality, relying instead on simple concatenation and Transformer to combine similar and dissimilar features. To fully harness the potential of text modality and the dissimilar features between the modalities, we propose a Text-centric Disentangled Representation Interaction Network (TDRIN), consisting of two main modules. In the Disentangled Representation Learning (DRL) module, we decompose representations from different modalities into separate sub-spaces centered around text modality, aiming to capture similar and dissimilar features among the modalities. Meanwhile, we utilize various constraints to learn better features and improve predictions. Additionally, to more effectively balance the similar and dissimilar features, we design the Disentangled Representation Fusion Network (DRFN) module, which fuses disentangled representations with text modality as the center, fully exploiting the correlations among disentangled representations. Extensive experiments on the CMU-MOSI, CMU-MOSEI, and CH-SIMS datasets demonstrate that TDRIN outperforms state-of-the-art methods across various metrics. Specifically, the F1 score surpasses the best-performing baseline by 3.19%, 0.96%, and 1.43% on the three datasets, respectively. Ablation studies further confirm the effectiveness of each module. Therefore, TDRIN effectively reduces the heterogeneity between modalities, resulting in improved performance in MSA tasks.</div></div>","PeriodicalId":19268,"journal":{"name":"Neurocomputing","volume":"651 ","pages":"Article 130857"},"PeriodicalIF":5.5000,"publicationDate":"2025-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Neurocomputing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0925231225015292","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0

Abstract

With the rise of short video content, Multimodal Sentiment Analysis (MSA) has gained significant attention as a research hotspot. However, the issue of heterogeneity has emerged as a major challenge in fusing these three modalities. While some recent studies have attempted to reduce this problem of heterogeneity by disentangling the modalities, they overlooked two critical issues. First, their approach treats all three modalities equally during disentanglement, overlooking the central role of the text modality in MSA. As the primary carrier of semantic and emotional information, the text modality serves as the backbone for sentiment interpretation and multimodal fusion. Additionally, after disentangling the modalities, they do not effectively leverage the unique features of each modality, relying instead on simple concatenation and Transformer to combine similar and dissimilar features. To fully harness the potential of text modality and the dissimilar features between the modalities, we propose a Text-centric Disentangled Representation Interaction Network (TDRIN), consisting of two main modules. In the Disentangled Representation Learning (DRL) module, we decompose representations from different modalities into separate sub-spaces centered around text modality, aiming to capture similar and dissimilar features among the modalities. Meanwhile, we utilize various constraints to learn better features and improve predictions. Additionally, to more effectively balance the similar and dissimilar features, we design the Disentangled Representation Fusion Network (DRFN) module, which fuses disentangled representations with text modality as the center, fully exploiting the correlations among disentangled representations. Extensive experiments on the CMU-MOSI, CMU-MOSEI, and CH-SIMS datasets demonstrate that TDRIN outperforms state-of-the-art methods across various metrics. Specifically, the F1 score surpasses the best-performing baseline by 3.19%, 0.96%, and 1.43% on the three datasets, respectively. Ablation studies further confirm the effectiveness of each module. Therefore, TDRIN effectively reduces the heterogeneity between modalities, resulting in improved performance in MSA tasks.
面向多模态情感分析的文本中心解纠缠表示交互网络
随着短视频内容的兴起,多模态情感分析(MSA)作为一个研究热点受到了广泛关注。然而,异质性问题已成为融合这三种模式的主要挑战。虽然最近的一些研究试图通过解开模式来减少这种异质性问题,但他们忽视了两个关键问题。首先,他们的方法在解纠错过程中平等对待所有三种模态,忽视了语篇模态在MSA中的核心作用。作为语义和情感信息的主要载体,语篇情态是情感解释和多情态融合的主干。此外,在分离模式之后,它们不能有效地利用每个模式的独特特性,而是依赖于简单的连接和Transformer来组合相似和不同的特性。为了充分利用文本模态的潜力和模态之间的不同特征,我们提出了一个以文本为中心的解纠缠表示交互网络(TDRIN),该网络由两个主要模块组成。在解纠缠表示学习(Disentangled Representation Learning, DRL)模块中,我们将来自不同模态的表示分解为以文本模态为中心的独立子空间,旨在捕获模态之间的相似和不同特征。同时,我们利用各种约束来学习更好的特征和改进预测。此外,为了更有效地平衡相似和不相似特征,我们设计了以文本模态为中心的解纠缠表示融合网络(DRFN)模块,充分利用解纠缠表示之间的相关性。在CMU-MOSI、CMU-MOSEI和CH-SIMS数据集上进行的大量实验表明,TDRIN在各种指标上都优于最先进的方法。具体而言,在三个数据集上,F1得分分别超过最佳表现基线3.19%,0.96%和1.43%。烧蚀研究进一步证实了各模块的有效性。因此,TDRIN有效地减少了模态之间的异质性,从而提高了MSA任务的性能。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Neurocomputing
Neurocomputing 工程技术-计算机:人工智能
CiteScore
13.10
自引率
10.00%
发文量
1382
审稿时长
70 days
期刊介绍: Neurocomputing publishes articles describing recent fundamental contributions in the field of neurocomputing. Neurocomputing theory, practice and applications are the essential topics being covered.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信