多模态情感分析中缺失模态的文本引导对比学习与标记级重建网络

IF 15.5 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Information Fusion Pub Date : 2025-08-05 DOI:10.1016/j.inffus.2025.103571

Zhihao Yang , Qing He , Minghao Yu , Nisuo Du , Yijie Lu

{"title":"多模态情感分析中缺失模态的文本引导对比学习与标记级重建网络","authors":"Zhihao Yang , Qing He , Minghao Yu , Nisuo Du , Yijie Lu","doi":"10.1016/j.inffus.2025.103571","DOIUrl":null,"url":null,"abstract":"<div><div>Multimodal sentiment analysis (MSA) tasks in incomplete multimodal data scenarios must account for random missing or noisy interference of modality information, aiming to perform robust sentiment analysis on multimodal data. This also reflects the trend of MSA tasks transitioning from idealized laboratory settings to real-world conditions, making it a current research hotspot in multimodal learning. However, existing studies still face limitations in missing modeling analysis, and lacking effective modeling of missing scenarios. Moreover, current methods primarily focus on completing missing modality features in the feature space, overlooking information supplementation in the semantic space, which is crucial for multimodal sentiment analysis tasks. To address this, we propose a text-guided fine-grained network model: Text-Guided Contrastive Learning with Token-Level Reconstruction Network (TCTR). This is motivated by the fact that the text modality typically contains more direct and complete sentiment information. In TCTR, we first design the Token-level Missing Inspection (TMI) module to perform token-level missing modeling on the guided modality, addressing the limitation of insufficient capture of critical sentiment information in missing inspection through fine-grained missing analysis. Subsequently, in the Semantic Contrastive Learning for Missing Modality Supplementation (SCL-MMS) module, we leverage constructed negative sample labels to jointly complete missing sentiment information from both the feature space and the semantic space, mitigating the issue of inadequate supplementation quality caused by relying solely on the feature space in existing methods. Finally, building on prior research, we perform interaction and fusion of multimodal features to enable sentiment polarity prediction. Through performance comparisons with state-of-the-art methods and ablation studies on various datasets, the experimental results demonstrate that TCTR achieves superior sentiment polarity prediction across different modality-missing scenarios, effectively enhancing the robustness of MSA tasks in such conditions.</div></div>","PeriodicalId":50367,"journal":{"name":"Information Fusion","volume":"126 ","pages":"Article 103571"},"PeriodicalIF":15.5000,"publicationDate":"2025-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"TCTR: Text-Guided Contrastive Learning with Token-Level Reconstruction Network for missing modalities in multimodal sentiment analysis\",\"authors\":\"Zhihao Yang , Qing He , Minghao Yu , Nisuo Du , Yijie Lu\",\"doi\":\"10.1016/j.inffus.2025.103571\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Multimodal sentiment analysis (MSA) tasks in incomplete multimodal data scenarios must account for random missing or noisy interference of modality information, aiming to perform robust sentiment analysis on multimodal data. This also reflects the trend of MSA tasks transitioning from idealized laboratory settings to real-world conditions, making it a current research hotspot in multimodal learning. However, existing studies still face limitations in missing modeling analysis, and lacking effective modeling of missing scenarios. Moreover, current methods primarily focus on completing missing modality features in the feature space, overlooking information supplementation in the semantic space, which is crucial for multimodal sentiment analysis tasks. To address this, we propose a text-guided fine-grained network model: Text-Guided Contrastive Learning with Token-Level Reconstruction Network (TCTR). This is motivated by the fact that the text modality typically contains more direct and complete sentiment information. In TCTR, we first design the Token-level Missing Inspection (TMI) module to perform token-level missing modeling on the guided modality, addressing the limitation of insufficient capture of critical sentiment information in missing inspection through fine-grained missing analysis. Subsequently, in the Semantic Contrastive Learning for Missing Modality Supplementation (SCL-MMS) module, we leverage constructed negative sample labels to jointly complete missing sentiment information from both the feature space and the semantic space, mitigating the issue of inadequate supplementation quality caused by relying solely on the feature space in existing methods. Finally, building on prior research, we perform interaction and fusion of multimodal features to enable sentiment polarity prediction. Through performance comparisons with state-of-the-art methods and ablation studies on various datasets, the experimental results demonstrate that TCTR achieves superior sentiment polarity prediction across different modality-missing scenarios, effectively enhancing the robustness of MSA tasks in such conditions.</div></div>\",\"PeriodicalId\":50367,\"journal\":{\"name\":\"Information Fusion\",\"volume\":\"126 \",\"pages\":\"Article 103571\"},\"PeriodicalIF\":15.5000,\"publicationDate\":\"2025-08-05\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Information Fusion\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S1566253525006438\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information Fusion","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1566253525006438","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

不完全多模态数据场景下的多模态情感分析（MSA）任务必须考虑到模态信息的随机缺失或噪声干扰，旨在对多模态数据进行鲁棒性情感分析。这也反映了MSA任务从理想的实验室环境向现实环境过渡的趋势，使其成为当前多模态学习的研究热点。然而，现有研究在缺失建模分析方面仍存在局限性，缺乏对缺失情景的有效建模。此外，目前的方法主要侧重于在特征空间中完成缺失的情态特征，而忽略了在语义空间中补充信息，这对于多模态情感分析任务至关重要。为了解决这个问题，我们提出了一个文本引导的细粒度网络模型：文本引导对比学习与标记级重建网络（TCTR）。这是因为文本情态通常包含更直接和完整的情感信息。在TCTR中，我们首先设计了Token-level Missing Inspection （TMI）模块，在引导模态上进行Token-level Missing modeling，通过细粒度的Missing analysis解决了Missing Inspection中关键情感信息捕获不足的问题。随后，在缺失情态补充的语义对比学习（SCL-MMS）模块中，我们利用构建的负样本标签，从特征空间和语义空间共同完成缺失的情感信息，缓解了现有方法仅依赖特征空间而导致的补充质量不足的问题。最后，在先前研究的基础上，我们进行了多模态特征的交互和融合，以实现情感极性预测。通过与最先进方法的性能比较以及对各种数据集的消融研究，实验结果表明，TCTR在不同模态缺失场景下实现了优越的情感极性预测，有效增强了MSA任务在这种情况下的鲁棒性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

TCTR: Text-Guided Contrastive Learning with Token-Level Reconstruction Network for missing modalities in multimodal sentiment analysis

Multimodal sentiment analysis (MSA) tasks in incomplete multimodal data scenarios must account for random missing or noisy interference of modality information, aiming to perform robust sentiment analysis on multimodal data. This also reflects the trend of MSA tasks transitioning from idealized laboratory settings to real-world conditions, making it a current research hotspot in multimodal learning. However, existing studies still face limitations in missing modeling analysis, and lacking effective modeling of missing scenarios. Moreover, current methods primarily focus on completing missing modality features in the feature space, overlooking information supplementation in the semantic space, which is crucial for multimodal sentiment analysis tasks. To address this, we propose a text-guided fine-grained network model: Text-Guided Contrastive Learning with Token-Level Reconstruction Network (TCTR). This is motivated by the fact that the text modality typically contains more direct and complete sentiment information. In TCTR, we first design the Token-level Missing Inspection (TMI) module to perform token-level missing modeling on the guided modality, addressing the limitation of insufficient capture of critical sentiment information in missing inspection through fine-grained missing analysis. Subsequently, in the Semantic Contrastive Learning for Missing Modality Supplementation (SCL-MMS) module, we leverage constructed negative sample labels to jointly complete missing sentiment information from both the feature space and the semantic space, mitigating the issue of inadequate supplementation quality caused by relying solely on the feature space in existing methods. Finally, building on prior research, we perform interaction and fusion of multimodal features to enable sentiment polarity prediction. Through performance comparisons with state-of-the-art methods and ablation studies on various datasets, the experimental results demonstrate that TCTR achieves superior sentiment polarity prediction across different modality-missing scenarios, effectively enhancing the robustness of MSA tasks in such conditions.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Information Fusion 工程技术-计算机：理论方法

CiteScore

33.20

自引率

4.30%

发文量

161

审稿时长

7.9 months

期刊介绍： Information Fusion serves as a central platform for showcasing advancements in multi-sensor, multi-source, multi-process information fusion, fostering collaboration among diverse disciplines driving its progress. It is the leading outlet for sharing research and development in this field, focusing on architectures, algorithms, and applications. Papers dealing with fundamental theoretical analyses as well as those demonstrating their application to real-world problems will be welcome.