TEMSA：用于多模态情感分析的文本增强模态表示学习

IF 3.5 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Vision and Image Understanding Pub Date : 2025-05-16 DOI:10.1016/j.cviu.2025.104391

Jingwen Chen , Shuxiang Song , Yumei Tan , Haiying Xia

{"title":"TEMSA：用于多模态情感分析的文本增强模态表示学习","authors":"Jingwen Chen , Shuxiang Song , Yumei Tan , Haiying Xia","doi":"10.1016/j.cviu.2025.104391","DOIUrl":null,"url":null,"abstract":"<div><div>Multimodal sentiment analysis aims to identify human emotions by leveraging multimodal information, including language, visual, and audio data. Most existing models focus on extracting common features across modalities or simply integrating heterogeneous multimodal data. However, such approaches often overlook the unique representation advantages of individual modalities, as they treat all modalities equally and use bidirectional information transfer mechanisms. This can lead to information redundancy and feature conflicts. To address this challenge, we propose a Text-Enhanced Modal Representation Learning Model (TEMSA), which builds robust and unified multimodal representations through the design of text-guided pairwise cross-modal mapping modules. Specifically, TEMSA employs a text-guided multi-head cross-attention mechanism to embed linguistic information into the emotion-related representation learning of non-linguistic modalities, thereby enhancing the representations of visual and audio modalities. In addition to preserving consistent information through cross-modal mapping, TEMSA also incorporates text-guided reconstruction modules, which leverage text-enhanced non-linguistic modal features to decouple modality-specific representations from non-linguistic modalities. This dual representation learning framework captures inter-modal consistent information through cross-modal mapping, and extracts modal difference information through intra-modal decoupling, thus improving the understanding of cross-modal affective associations. The experimental results on the CMU-MOSI, CMU-MOSEI, and CH-SIMS datasets demonstrate that TEMSA achieves superior performance, highlighting the critical role of text-guided cross-modal and intra-modal representation learning in multimodal sentiment analysis.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"258 ","pages":"Article 104391"},"PeriodicalIF":3.5000,"publicationDate":"2025-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"TEMSA:Text enhanced modal representation learning for multimodal sentiment analysis\",\"authors\":\"Jingwen Chen , Shuxiang Song , Yumei Tan , Haiying Xia\",\"doi\":\"10.1016/j.cviu.2025.104391\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Multimodal sentiment analysis aims to identify human emotions by leveraging multimodal information, including language, visual, and audio data. Most existing models focus on extracting common features across modalities or simply integrating heterogeneous multimodal data. However, such approaches often overlook the unique representation advantages of individual modalities, as they treat all modalities equally and use bidirectional information transfer mechanisms. This can lead to information redundancy and feature conflicts. To address this challenge, we propose a Text-Enhanced Modal Representation Learning Model (TEMSA), which builds robust and unified multimodal representations through the design of text-guided pairwise cross-modal mapping modules. Specifically, TEMSA employs a text-guided multi-head cross-attention mechanism to embed linguistic information into the emotion-related representation learning of non-linguistic modalities, thereby enhancing the representations of visual and audio modalities. In addition to preserving consistent information through cross-modal mapping, TEMSA also incorporates text-guided reconstruction modules, which leverage text-enhanced non-linguistic modal features to decouple modality-specific representations from non-linguistic modalities. This dual representation learning framework captures inter-modal consistent information through cross-modal mapping, and extracts modal difference information through intra-modal decoupling, thus improving the understanding of cross-modal affective associations. The experimental results on the CMU-MOSI, CMU-MOSEI, and CH-SIMS datasets demonstrate that TEMSA achieves superior performance, highlighting the critical role of text-guided cross-modal and intra-modal representation learning in multimodal sentiment analysis.</div></div>\",\"PeriodicalId\":50633,\"journal\":{\"name\":\"Computer Vision and Image Understanding\",\"volume\":\"258 \",\"pages\":\"Article 104391\"},\"PeriodicalIF\":3.5000,\"publicationDate\":\"2025-05-16\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Computer Vision and Image Understanding\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S1077314225001146\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Vision and Image Understanding","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1077314225001146","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

多模态情感分析旨在通过利用多模态信息（包括语言、视觉和音频数据）来识别人类情感。大多数现有模型侧重于提取跨模态的共同特征或简单地集成异构多模态数据。然而，这些方法往往忽略了个体模态的独特表征优势，因为它们平等对待所有模态并使用双向信息传递机制。这可能导致信息冗余和特性冲突。为了解决这一挑战，我们提出了一个文本增强模态表示学习模型（TEMSA），该模型通过设计文本引导的两两跨模态映射模块来构建鲁棒和统一的多模态表示。具体而言，TEMSA采用文本引导的多头交叉注意机制，将语言信息嵌入到非语言模态的情绪相关表征学习中，从而增强视觉模态和听觉模态的表征。除了通过跨模态映射保持一致的信息外，TEMSA还集成了文本引导的重构模块，该模块利用文本增强的非语言模态特性将特定于模态的表示与非语言模态解耦。这种双表征学习框架通过跨模态映射捕获模态间一致性信息，通过模态内解耦提取模态差异信息，从而提高对跨模态情感关联的理解。在CMU-MOSI、CMU-MOSEI和CH-SIMS数据集上的实验结果表明，TEMSA取得了优异的性能，突出了文本引导的跨模态和模态内表征学习在多模态情感分析中的关键作用。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

TEMSA:Text enhanced modal representation learning for multimodal sentiment analysis

Multimodal sentiment analysis aims to identify human emotions by leveraging multimodal information, including language, visual, and audio data. Most existing models focus on extracting common features across modalities or simply integrating heterogeneous multimodal data. However, such approaches often overlook the unique representation advantages of individual modalities, as they treat all modalities equally and use bidirectional information transfer mechanisms. This can lead to information redundancy and feature conflicts. To address this challenge, we propose a Text-Enhanced Modal Representation Learning Model (TEMSA), which builds robust and unified multimodal representations through the design of text-guided pairwise cross-modal mapping modules. Specifically, TEMSA employs a text-guided multi-head cross-attention mechanism to embed linguistic information into the emotion-related representation learning of non-linguistic modalities, thereby enhancing the representations of visual and audio modalities. In addition to preserving consistent information through cross-modal mapping, TEMSA also incorporates text-guided reconstruction modules, which leverage text-enhanced non-linguistic modal features to decouple modality-specific representations from non-linguistic modalities. This dual representation learning framework captures inter-modal consistent information through cross-modal mapping, and extracts modal difference information through intra-modal decoupling, thus improving the understanding of cross-modal affective associations. The experimental results on the CMU-MOSI, CMU-MOSEI, and CH-SIMS datasets demonstrate that TEMSA achieves superior performance, highlighting the critical role of text-guided cross-modal and intra-modal representation learning in multimodal sentiment analysis.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Computer Vision and Image Understanding 工程技术-工程：电子与电气

CiteScore

7.80

自引率

4.40%

发文量

112

审稿时长

79 days

期刊介绍： The central focus of this journal is the computer analysis of pictorial information. Computer Vision and Image Understanding publishes papers covering all aspects of image analysis from the low-level, iconic processes of early vision to the high-level, symbolic processes of recognition and interpretation. A wide range of topics in the image understanding area is covered, including papers offering insights that differ from predominant views. Research Areas Include: • Theory • Early vision • Data structures and representations • Shape • Range • Motion • Matching and recognition • Architecture and languages • Vision systems