Jingming Hou , Nazlia Omar , Sabrina Tiun , Saidah Saad , Qian He
{"title":"面向多模态情感分析的文本中心解纠缠表示交互网络","authors":"Jingming Hou , Nazlia Omar , Sabrina Tiun , Saidah Saad , Qian He","doi":"10.1016/j.neucom.2025.130857","DOIUrl":null,"url":null,"abstract":"<div><div>With the rise of short video content, Multimodal Sentiment Analysis (MSA) has gained significant attention as a research hotspot. However, the issue of heterogeneity has emerged as a major challenge in fusing these three modalities. While some recent studies have attempted to reduce this problem of heterogeneity by disentangling the modalities, they overlooked two critical issues. First, their approach treats all three modalities equally during disentanglement, overlooking the central role of the text modality in MSA. As the primary carrier of semantic and emotional information, the text modality serves as the backbone for sentiment interpretation and multimodal fusion. Additionally, after disentangling the modalities, they do not effectively leverage the unique features of each modality, relying instead on simple concatenation and Transformer to combine similar and dissimilar features. To fully harness the potential of text modality and the dissimilar features between the modalities, we propose a Text-centric Disentangled Representation Interaction Network (TDRIN), consisting of two main modules. In the Disentangled Representation Learning (DRL) module, we decompose representations from different modalities into separate sub-spaces centered around text modality, aiming to capture similar and dissimilar features among the modalities. Meanwhile, we utilize various constraints to learn better features and improve predictions. Additionally, to more effectively balance the similar and dissimilar features, we design the Disentangled Representation Fusion Network (DRFN) module, which fuses disentangled representations with text modality as the center, fully exploiting the correlations among disentangled representations. Extensive experiments on the CMU-MOSI, CMU-MOSEI, and CH-SIMS datasets demonstrate that TDRIN outperforms state-of-the-art methods across various metrics. Specifically, the F1 score surpasses the best-performing baseline by 3.19%, 0.96%, and 1.43% on the three datasets, respectively. Ablation studies further confirm the effectiveness of each module. Therefore, TDRIN effectively reduces the heterogeneity between modalities, resulting in improved performance in MSA tasks.</div></div>","PeriodicalId":19268,"journal":{"name":"Neurocomputing","volume":"651 ","pages":"Article 130857"},"PeriodicalIF":5.5000,"publicationDate":"2025-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Text-centric disentangled representation interaction network for Multimodal Sentiment Analysis\",\"authors\":\"Jingming Hou , Nazlia Omar , Sabrina Tiun , Saidah Saad , Qian He\",\"doi\":\"10.1016/j.neucom.2025.130857\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>With the rise of short video content, Multimodal Sentiment Analysis (MSA) has gained significant attention as a research hotspot. However, the issue of heterogeneity has emerged as a major challenge in fusing these three modalities. While some recent studies have attempted to reduce this problem of heterogeneity by disentangling the modalities, they overlooked two critical issues. First, their approach treats all three modalities equally during disentanglement, overlooking the central role of the text modality in MSA. As the primary carrier of semantic and emotional information, the text modality serves as the backbone for sentiment interpretation and multimodal fusion. Additionally, after disentangling the modalities, they do not effectively leverage the unique features of each modality, relying instead on simple concatenation and Transformer to combine similar and dissimilar features. To fully harness the potential of text modality and the dissimilar features between the modalities, we propose a Text-centric Disentangled Representation Interaction Network (TDRIN), consisting of two main modules. In the Disentangled Representation Learning (DRL) module, we decompose representations from different modalities into separate sub-spaces centered around text modality, aiming to capture similar and dissimilar features among the modalities. Meanwhile, we utilize various constraints to learn better features and improve predictions. Additionally, to more effectively balance the similar and dissimilar features, we design the Disentangled Representation Fusion Network (DRFN) module, which fuses disentangled representations with text modality as the center, fully exploiting the correlations among disentangled representations. Extensive experiments on the CMU-MOSI, CMU-MOSEI, and CH-SIMS datasets demonstrate that TDRIN outperforms state-of-the-art methods across various metrics. Specifically, the F1 score surpasses the best-performing baseline by 3.19%, 0.96%, and 1.43% on the three datasets, respectively. Ablation studies further confirm the effectiveness of each module. Therefore, TDRIN effectively reduces the heterogeneity between modalities, resulting in improved performance in MSA tasks.</div></div>\",\"PeriodicalId\":19268,\"journal\":{\"name\":\"Neurocomputing\",\"volume\":\"651 \",\"pages\":\"Article 130857\"},\"PeriodicalIF\":5.5000,\"publicationDate\":\"2025-07-18\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Neurocomputing\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0925231225015292\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Neurocomputing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0925231225015292","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
Text-centric disentangled representation interaction network for Multimodal Sentiment Analysis
With the rise of short video content, Multimodal Sentiment Analysis (MSA) has gained significant attention as a research hotspot. However, the issue of heterogeneity has emerged as a major challenge in fusing these three modalities. While some recent studies have attempted to reduce this problem of heterogeneity by disentangling the modalities, they overlooked two critical issues. First, their approach treats all three modalities equally during disentanglement, overlooking the central role of the text modality in MSA. As the primary carrier of semantic and emotional information, the text modality serves as the backbone for sentiment interpretation and multimodal fusion. Additionally, after disentangling the modalities, they do not effectively leverage the unique features of each modality, relying instead on simple concatenation and Transformer to combine similar and dissimilar features. To fully harness the potential of text modality and the dissimilar features between the modalities, we propose a Text-centric Disentangled Representation Interaction Network (TDRIN), consisting of two main modules. In the Disentangled Representation Learning (DRL) module, we decompose representations from different modalities into separate sub-spaces centered around text modality, aiming to capture similar and dissimilar features among the modalities. Meanwhile, we utilize various constraints to learn better features and improve predictions. Additionally, to more effectively balance the similar and dissimilar features, we design the Disentangled Representation Fusion Network (DRFN) module, which fuses disentangled representations with text modality as the center, fully exploiting the correlations among disentangled representations. Extensive experiments on the CMU-MOSI, CMU-MOSEI, and CH-SIMS datasets demonstrate that TDRIN outperforms state-of-the-art methods across various metrics. Specifically, the F1 score surpasses the best-performing baseline by 3.19%, 0.96%, and 1.43% on the three datasets, respectively. Ablation studies further confirm the effectiveness of each module. Therefore, TDRIN effectively reduces the heterogeneity between modalities, resulting in improved performance in MSA tasks.
期刊介绍:
Neurocomputing publishes articles describing recent fundamental contributions in the field of neurocomputing. Neurocomputing theory, practice and applications are the essential topics being covered.