面向多模态情感分析的文本中心解纠缠表示交互网络

IF 5.5 2区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Neurocomputing Pub Date : 2025-07-18 DOI:10.1016/j.neucom.2025.130857

Jingming Hou , Nazlia Omar , Sabrina Tiun , Saidah Saad , Qian He

{"title":"面向多模态情感分析的文本中心解纠缠表示交互网络","authors":"Jingming Hou , Nazlia Omar , Sabrina Tiun , Saidah Saad , Qian He","doi":"10.1016/j.neucom.2025.130857","DOIUrl":null,"url":null,"abstract":"<div><div>With the rise of short video content, Multimodal Sentiment Analysis (MSA) has gained significant attention as a research hotspot. However, the issue of heterogeneity has emerged as a major challenge in fusing these three modalities. While some recent studies have attempted to reduce this problem of heterogeneity by disentangling the modalities, they overlooked two critical issues. First, their approach treats all three modalities equally during disentanglement, overlooking the central role of the text modality in MSA. As the primary carrier of semantic and emotional information, the text modality serves as the backbone for sentiment interpretation and multimodal fusion. Additionally, after disentangling the modalities, they do not effectively leverage the unique features of each modality, relying instead on simple concatenation and Transformer to combine similar and dissimilar features. To fully harness the potential of text modality and the dissimilar features between the modalities, we propose a Text-centric Disentangled Representation Interaction Network (TDRIN), consisting of two main modules. In the Disentangled Representation Learning (DRL) module, we decompose representations from different modalities into separate sub-spaces centered around text modality, aiming to capture similar and dissimilar features among the modalities. Meanwhile, we utilize various constraints to learn better features and improve predictions. Additionally, to more effectively balance the similar and dissimilar features, we design the Disentangled Representation Fusion Network (DRFN) module, which fuses disentangled representations with text modality as the center, fully exploiting the correlations among disentangled representations. Extensive experiments on the CMU-MOSI, CMU-MOSEI, and CH-SIMS datasets demonstrate that TDRIN outperforms state-of-the-art methods across various metrics. Specifically, the F1 score surpasses the best-performing baseline by 3.19%, 0.96%, and 1.43% on the three datasets, respectively. Ablation studies further confirm the effectiveness of each module. Therefore, TDRIN effectively reduces the heterogeneity between modalities, resulting in improved performance in MSA tasks.</div></div>","PeriodicalId":19268,"journal":{"name":"Neurocomputing","volume":"651 ","pages":"Article 130857"},"PeriodicalIF":5.5000,"publicationDate":"2025-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Text-centric disentangled representation interaction network for Multimodal Sentiment Analysis\",\"authors\":\"Jingming Hou , Nazlia Omar , Sabrina Tiun , Saidah Saad , Qian He\",\"doi\":\"10.1016/j.neucom.2025.130857\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>With the rise of short video content, Multimodal Sentiment Analysis (MSA) has gained significant attention as a research hotspot. However, the issue of heterogeneity has emerged as a major challenge in fusing these three modalities. While some recent studies have attempted to reduce this problem of heterogeneity by disentangling the modalities, they overlooked two critical issues. First, their approach treats all three modalities equally during disentanglement, overlooking the central role of the text modality in MSA. As the primary carrier of semantic and emotional information, the text modality serves as the backbone for sentiment interpretation and multimodal fusion. Additionally, after disentangling the modalities, they do not effectively leverage the unique features of each modality, relying instead on simple concatenation and Transformer to combine similar and dissimilar features. To fully harness the potential of text modality and the dissimilar features between the modalities, we propose a Text-centric Disentangled Representation Interaction Network (TDRIN), consisting of two main modules. In the Disentangled Representation Learning (DRL) module, we decompose representations from different modalities into separate sub-spaces centered around text modality, aiming to capture similar and dissimilar features among the modalities. Meanwhile, we utilize various constraints to learn better features and improve predictions. Additionally, to more effectively balance the similar and dissimilar features, we design the Disentangled Representation Fusion Network (DRFN) module, which fuses disentangled representations with text modality as the center, fully exploiting the correlations among disentangled representations. Extensive experiments on the CMU-MOSI, CMU-MOSEI, and CH-SIMS datasets demonstrate that TDRIN outperforms state-of-the-art methods across various metrics. Specifically, the F1 score surpasses the best-performing baseline by 3.19%, 0.96%, and 1.43% on the three datasets, respectively. Ablation studies further confirm the effectiveness of each module. Therefore, TDRIN effectively reduces the heterogeneity between modalities, resulting in improved performance in MSA tasks.</div></div>\",\"PeriodicalId\":19268,\"journal\":{\"name\":\"Neurocomputing\",\"volume\":\"651 \",\"pages\":\"Article 130857\"},\"PeriodicalIF\":5.5000,\"publicationDate\":\"2025-07-18\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Neurocomputing\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0925231225015292\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Neurocomputing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0925231225015292","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

随着短视频内容的兴起，多模态情感分析（MSA）作为一个研究热点受到了广泛关注。然而，异质性问题已成为融合这三种模式的主要挑战。虽然最近的一些研究试图通过解开模式来减少这种异质性问题，但他们忽视了两个关键问题。首先，他们的方法在解纠错过程中平等对待所有三种模态，忽视了语篇模态在MSA中的核心作用。作为语义和情感信息的主要载体，语篇情态是情感解释和多情态融合的主干。此外，在分离模式之后，它们不能有效地利用每个模式的独特特性，而是依赖于简单的连接和Transformer来组合相似和不同的特性。为了充分利用文本模态的潜力和模态之间的不同特征，我们提出了一个以文本为中心的解纠缠表示交互网络（TDRIN），该网络由两个主要模块组成。在解纠缠表示学习（Disentangled Representation Learning， DRL）模块中，我们将来自不同模态的表示分解为以文本模态为中心的独立子空间，旨在捕获模态之间的相似和不同特征。同时，我们利用各种约束来学习更好的特征和改进预测。此外，为了更有效地平衡相似和不相似特征，我们设计了以文本模态为中心的解纠缠表示融合网络（DRFN）模块，充分利用解纠缠表示之间的相关性。在CMU-MOSI、CMU-MOSEI和CH-SIMS数据集上进行的大量实验表明，TDRIN在各种指标上都优于最先进的方法。具体而言，在三个数据集上，F1得分分别超过最佳表现基线3.19%，0.96%和1.43%。烧蚀研究进一步证实了各模块的有效性。因此，TDRIN有效地减少了模态之间的异质性，从而提高了MSA任务的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Text-centric disentangled representation interaction network for Multimodal Sentiment Analysis

With the rise of short video content, Multimodal Sentiment Analysis (MSA) has gained significant attention as a research hotspot. However, the issue of heterogeneity has emerged as a major challenge in fusing these three modalities. While some recent studies have attempted to reduce this problem of heterogeneity by disentangling the modalities, they overlooked two critical issues. First, their approach treats all three modalities equally during disentanglement, overlooking the central role of the text modality in MSA. As the primary carrier of semantic and emotional information, the text modality serves as the backbone for sentiment interpretation and multimodal fusion. Additionally, after disentangling the modalities, they do not effectively leverage the unique features of each modality, relying instead on simple concatenation and Transformer to combine similar and dissimilar features. To fully harness the potential of text modality and the dissimilar features between the modalities, we propose a Text-centric Disentangled Representation Interaction Network (TDRIN), consisting of two main modules. In the Disentangled Representation Learning (DRL) module, we decompose representations from different modalities into separate sub-spaces centered around text modality, aiming to capture similar and dissimilar features among the modalities. Meanwhile, we utilize various constraints to learn better features and improve predictions. Additionally, to more effectively balance the similar and dissimilar features, we design the Disentangled Representation Fusion Network (DRFN) module, which fuses disentangled representations with text modality as the center, fully exploiting the correlations among disentangled representations. Extensive experiments on the CMU-MOSI, CMU-MOSEI, and CH-SIMS datasets demonstrate that TDRIN outperforms state-of-the-art methods across various metrics. Specifically, the F1 score surpasses the best-performing baseline by 3.19%, 0.96%, and 1.43% on the three datasets, respectively. Ablation studies further confirm the effectiveness of each module. Therefore, TDRIN effectively reduces the heterogeneity between modalities, resulting in improved performance in MSA tasks.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Neurocomputing 工程技术-计算机：人工智能

CiteScore

13.10

自引率

10.00%

发文量

1382

审稿时长

70 days

期刊介绍： Neurocomputing publishes articles describing recent fundamental contributions in the field of neurocomputing. Neurocomputing theory, practice and applications are the essential topics being covered.