Zhihao Yang , Qing He , Minghao Yu , Nisuo Du , Yijie Lu
{"title":"多模态情感分析中缺失模态的文本引导对比学习与标记级重建网络","authors":"Zhihao Yang , Qing He , Minghao Yu , Nisuo Du , Yijie Lu","doi":"10.1016/j.inffus.2025.103571","DOIUrl":null,"url":null,"abstract":"<div><div>Multimodal sentiment analysis (MSA) tasks in incomplete multimodal data scenarios must account for random missing or noisy interference of modality information, aiming to perform robust sentiment analysis on multimodal data. This also reflects the trend of MSA tasks transitioning from idealized laboratory settings to real-world conditions, making it a current research hotspot in multimodal learning. However, existing studies still face limitations in missing modeling analysis, and lacking effective modeling of missing scenarios. Moreover, current methods primarily focus on completing missing modality features in the feature space, overlooking information supplementation in the semantic space, which is crucial for multimodal sentiment analysis tasks. To address this, we propose a text-guided fine-grained network model: Text-Guided Contrastive Learning with Token-Level Reconstruction Network (TCTR). This is motivated by the fact that the text modality typically contains more direct and complete sentiment information. In TCTR, we first design the Token-level Missing Inspection (TMI) module to perform token-level missing modeling on the guided modality, addressing the limitation of insufficient capture of critical sentiment information in missing inspection through fine-grained missing analysis. Subsequently, in the Semantic Contrastive Learning for Missing Modality Supplementation (SCL-MMS) module, we leverage constructed negative sample labels to jointly complete missing sentiment information from both the feature space and the semantic space, mitigating the issue of inadequate supplementation quality caused by relying solely on the feature space in existing methods. Finally, building on prior research, we perform interaction and fusion of multimodal features to enable sentiment polarity prediction. Through performance comparisons with state-of-the-art methods and ablation studies on various datasets, the experimental results demonstrate that TCTR achieves superior sentiment polarity prediction across different modality-missing scenarios, effectively enhancing the robustness of MSA tasks in such conditions.</div></div>","PeriodicalId":50367,"journal":{"name":"Information Fusion","volume":"126 ","pages":"Article 103571"},"PeriodicalIF":15.5000,"publicationDate":"2025-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"TCTR: Text-Guided Contrastive Learning with Token-Level Reconstruction Network for missing modalities in multimodal sentiment analysis\",\"authors\":\"Zhihao Yang , Qing He , Minghao Yu , Nisuo Du , Yijie Lu\",\"doi\":\"10.1016/j.inffus.2025.103571\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Multimodal sentiment analysis (MSA) tasks in incomplete multimodal data scenarios must account for random missing or noisy interference of modality information, aiming to perform robust sentiment analysis on multimodal data. This also reflects the trend of MSA tasks transitioning from idealized laboratory settings to real-world conditions, making it a current research hotspot in multimodal learning. However, existing studies still face limitations in missing modeling analysis, and lacking effective modeling of missing scenarios. Moreover, current methods primarily focus on completing missing modality features in the feature space, overlooking information supplementation in the semantic space, which is crucial for multimodal sentiment analysis tasks. To address this, we propose a text-guided fine-grained network model: Text-Guided Contrastive Learning with Token-Level Reconstruction Network (TCTR). This is motivated by the fact that the text modality typically contains more direct and complete sentiment information. In TCTR, we first design the Token-level Missing Inspection (TMI) module to perform token-level missing modeling on the guided modality, addressing the limitation of insufficient capture of critical sentiment information in missing inspection through fine-grained missing analysis. Subsequently, in the Semantic Contrastive Learning for Missing Modality Supplementation (SCL-MMS) module, we leverage constructed negative sample labels to jointly complete missing sentiment information from both the feature space and the semantic space, mitigating the issue of inadequate supplementation quality caused by relying solely on the feature space in existing methods. Finally, building on prior research, we perform interaction and fusion of multimodal features to enable sentiment polarity prediction. Through performance comparisons with state-of-the-art methods and ablation studies on various datasets, the experimental results demonstrate that TCTR achieves superior sentiment polarity prediction across different modality-missing scenarios, effectively enhancing the robustness of MSA tasks in such conditions.</div></div>\",\"PeriodicalId\":50367,\"journal\":{\"name\":\"Information Fusion\",\"volume\":\"126 \",\"pages\":\"Article 103571\"},\"PeriodicalIF\":15.5000,\"publicationDate\":\"2025-08-05\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Information Fusion\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S1566253525006438\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information Fusion","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1566253525006438","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
TCTR: Text-Guided Contrastive Learning with Token-Level Reconstruction Network for missing modalities in multimodal sentiment analysis
Multimodal sentiment analysis (MSA) tasks in incomplete multimodal data scenarios must account for random missing or noisy interference of modality information, aiming to perform robust sentiment analysis on multimodal data. This also reflects the trend of MSA tasks transitioning from idealized laboratory settings to real-world conditions, making it a current research hotspot in multimodal learning. However, existing studies still face limitations in missing modeling analysis, and lacking effective modeling of missing scenarios. Moreover, current methods primarily focus on completing missing modality features in the feature space, overlooking information supplementation in the semantic space, which is crucial for multimodal sentiment analysis tasks. To address this, we propose a text-guided fine-grained network model: Text-Guided Contrastive Learning with Token-Level Reconstruction Network (TCTR). This is motivated by the fact that the text modality typically contains more direct and complete sentiment information. In TCTR, we first design the Token-level Missing Inspection (TMI) module to perform token-level missing modeling on the guided modality, addressing the limitation of insufficient capture of critical sentiment information in missing inspection through fine-grained missing analysis. Subsequently, in the Semantic Contrastive Learning for Missing Modality Supplementation (SCL-MMS) module, we leverage constructed negative sample labels to jointly complete missing sentiment information from both the feature space and the semantic space, mitigating the issue of inadequate supplementation quality caused by relying solely on the feature space in existing methods. Finally, building on prior research, we perform interaction and fusion of multimodal features to enable sentiment polarity prediction. Through performance comparisons with state-of-the-art methods and ablation studies on various datasets, the experimental results demonstrate that TCTR achieves superior sentiment polarity prediction across different modality-missing scenarios, effectively enhancing the robustness of MSA tasks in such conditions.
期刊介绍:
Information Fusion serves as a central platform for showcasing advancements in multi-sensor, multi-source, multi-process information fusion, fostering collaboration among diverse disciplines driving its progress. It is the leading outlet for sharing research and development in this field, focusing on architectures, algorithms, and applications. Papers dealing with fundamental theoretical analyses as well as those demonstrating their application to real-world problems will be welcome.