Zuhe Li , Panbo Liu , Yushan Pan , Weiping Ding , Jun Yu , Haoran Chen , Weihua Liu , Yiming Luo , Hao Wang
{"title":"Multimodal sentiment analysis based on disentangled representation learning and cross-modal-context association mining","authors":"Zuhe Li , Panbo Liu , Yushan Pan , Weiping Ding , Jun Yu , Haoran Chen , Weihua Liu , Yiming Luo , Hao Wang","doi":"10.1016/j.neucom.2024.128940","DOIUrl":null,"url":null,"abstract":"<div><div>Multimodal sentiment analysis aims to extract sentiment information expressed by users from multimodal data, including linguistic, acoustic, and visual cues. However, the heterogeneity of multimodal data leads to disparities in modal distribution, thereby impacting the model’s ability to effectively integrate complementarity and redundancy across modalities. Additionally, existing approaches often merge modalities directly after obtaining their representations, overlooking potential emotional correlations between them. To tackle these challenges, we propose a Multiview Collaborative Perception (MVCP) framework for multimodal sentiment analysis. This framework consists primarily of two modules: Multimodal Disentangled Representation Learning (MDRL) and Cross-Modal Context Association Mining (CMCAM). The MDRL module employs a joint learning layer comprising a common encoder and an exclusive encoder. This layer maps multimodal data to a hypersphere, learning common and exclusive representations for each modality, thus mitigating the semantic gap arising from modal heterogeneity. To further bridge semantic gaps and capture complex inter-modal correlations, the CMCAM module utilizes multiple attention mechanisms to mine cross-modal and contextual sentiment associations, yielding joint representations with rich multimodal semantic interactions. In this stage, the CMCAM module only discovers the correlation information among the common representations in order to maintain the exclusive representations of different modalities. Finally, a multitask learning framework is adopted to achieve parameter sharing between single-modal tasks and improve sentiment prediction performance. Experimental results on the MOSI and MOSEI datasets demonstrate the effectiveness of the proposed method.</div></div>","PeriodicalId":19268,"journal":{"name":"Neurocomputing","volume":"617 ","pages":"Article 128940"},"PeriodicalIF":5.5000,"publicationDate":"2024-11-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Neurocomputing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0925231224017119","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Multimodal sentiment analysis aims to extract sentiment information expressed by users from multimodal data, including linguistic, acoustic, and visual cues. However, the heterogeneity of multimodal data leads to disparities in modal distribution, thereby impacting the model’s ability to effectively integrate complementarity and redundancy across modalities. Additionally, existing approaches often merge modalities directly after obtaining their representations, overlooking potential emotional correlations between them. To tackle these challenges, we propose a Multiview Collaborative Perception (MVCP) framework for multimodal sentiment analysis. This framework consists primarily of two modules: Multimodal Disentangled Representation Learning (MDRL) and Cross-Modal Context Association Mining (CMCAM). The MDRL module employs a joint learning layer comprising a common encoder and an exclusive encoder. This layer maps multimodal data to a hypersphere, learning common and exclusive representations for each modality, thus mitigating the semantic gap arising from modal heterogeneity. To further bridge semantic gaps and capture complex inter-modal correlations, the CMCAM module utilizes multiple attention mechanisms to mine cross-modal and contextual sentiment associations, yielding joint representations with rich multimodal semantic interactions. In this stage, the CMCAM module only discovers the correlation information among the common representations in order to maintain the exclusive representations of different modalities. Finally, a multitask learning framework is adopted to achieve parameter sharing between single-modal tasks and improve sentiment prediction performance. Experimental results on the MOSI and MOSEI datasets demonstrate the effectiveness of the proposed method.
期刊介绍:
Neurocomputing publishes articles describing recent fundamental contributions in the field of neurocomputing. Neurocomputing theory, practice and applications are the essential topics being covered.