Jingwen Chen , Shuxiang Song , Yumei Tan , Haiying Xia
{"title":"TEMSA:用于多模态情感分析的文本增强模态表示学习","authors":"Jingwen Chen , Shuxiang Song , Yumei Tan , Haiying Xia","doi":"10.1016/j.cviu.2025.104391","DOIUrl":null,"url":null,"abstract":"<div><div>Multimodal sentiment analysis aims to identify human emotions by leveraging multimodal information, including language, visual, and audio data. Most existing models focus on extracting common features across modalities or simply integrating heterogeneous multimodal data. However, such approaches often overlook the unique representation advantages of individual modalities, as they treat all modalities equally and use bidirectional information transfer mechanisms. This can lead to information redundancy and feature conflicts. To address this challenge, we propose a Text-Enhanced Modal Representation Learning Model (TEMSA), which builds robust and unified multimodal representations through the design of text-guided pairwise cross-modal mapping modules. Specifically, TEMSA employs a text-guided multi-head cross-attention mechanism to embed linguistic information into the emotion-related representation learning of non-linguistic modalities, thereby enhancing the representations of visual and audio modalities. In addition to preserving consistent information through cross-modal mapping, TEMSA also incorporates text-guided reconstruction modules, which leverage text-enhanced non-linguistic modal features to decouple modality-specific representations from non-linguistic modalities. This dual representation learning framework captures inter-modal consistent information through cross-modal mapping, and extracts modal difference information through intra-modal decoupling, thus improving the understanding of cross-modal affective associations. The experimental results on the CMU-MOSI, CMU-MOSEI, and CH-SIMS datasets demonstrate that TEMSA achieves superior performance, highlighting the critical role of text-guided cross-modal and intra-modal representation learning in multimodal sentiment analysis.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"258 ","pages":"Article 104391"},"PeriodicalIF":4.3000,"publicationDate":"2025-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"TEMSA:Text enhanced modal representation learning for multimodal sentiment analysis\",\"authors\":\"Jingwen Chen , Shuxiang Song , Yumei Tan , Haiying Xia\",\"doi\":\"10.1016/j.cviu.2025.104391\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Multimodal sentiment analysis aims to identify human emotions by leveraging multimodal information, including language, visual, and audio data. Most existing models focus on extracting common features across modalities or simply integrating heterogeneous multimodal data. However, such approaches often overlook the unique representation advantages of individual modalities, as they treat all modalities equally and use bidirectional information transfer mechanisms. This can lead to information redundancy and feature conflicts. To address this challenge, we propose a Text-Enhanced Modal Representation Learning Model (TEMSA), which builds robust and unified multimodal representations through the design of text-guided pairwise cross-modal mapping modules. Specifically, TEMSA employs a text-guided multi-head cross-attention mechanism to embed linguistic information into the emotion-related representation learning of non-linguistic modalities, thereby enhancing the representations of visual and audio modalities. In addition to preserving consistent information through cross-modal mapping, TEMSA also incorporates text-guided reconstruction modules, which leverage text-enhanced non-linguistic modal features to decouple modality-specific representations from non-linguistic modalities. This dual representation learning framework captures inter-modal consistent information through cross-modal mapping, and extracts modal difference information through intra-modal decoupling, thus improving the understanding of cross-modal affective associations. The experimental results on the CMU-MOSI, CMU-MOSEI, and CH-SIMS datasets demonstrate that TEMSA achieves superior performance, highlighting the critical role of text-guided cross-modal and intra-modal representation learning in multimodal sentiment analysis.</div></div>\",\"PeriodicalId\":50633,\"journal\":{\"name\":\"Computer Vision and Image Understanding\",\"volume\":\"258 \",\"pages\":\"Article 104391\"},\"PeriodicalIF\":4.3000,\"publicationDate\":\"2025-05-16\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Computer Vision and Image Understanding\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S1077314225001146\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Vision and Image Understanding","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1077314225001146","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
TEMSA:Text enhanced modal representation learning for multimodal sentiment analysis
Multimodal sentiment analysis aims to identify human emotions by leveraging multimodal information, including language, visual, and audio data. Most existing models focus on extracting common features across modalities or simply integrating heterogeneous multimodal data. However, such approaches often overlook the unique representation advantages of individual modalities, as they treat all modalities equally and use bidirectional information transfer mechanisms. This can lead to information redundancy and feature conflicts. To address this challenge, we propose a Text-Enhanced Modal Representation Learning Model (TEMSA), which builds robust and unified multimodal representations through the design of text-guided pairwise cross-modal mapping modules. Specifically, TEMSA employs a text-guided multi-head cross-attention mechanism to embed linguistic information into the emotion-related representation learning of non-linguistic modalities, thereby enhancing the representations of visual and audio modalities. In addition to preserving consistent information through cross-modal mapping, TEMSA also incorporates text-guided reconstruction modules, which leverage text-enhanced non-linguistic modal features to decouple modality-specific representations from non-linguistic modalities. This dual representation learning framework captures inter-modal consistent information through cross-modal mapping, and extracts modal difference information through intra-modal decoupling, thus improving the understanding of cross-modal affective associations. The experimental results on the CMU-MOSI, CMU-MOSEI, and CH-SIMS datasets demonstrate that TEMSA achieves superior performance, highlighting the critical role of text-guided cross-modal and intra-modal representation learning in multimodal sentiment analysis.
期刊介绍:
The central focus of this journal is the computer analysis of pictorial information. Computer Vision and Image Understanding publishes papers covering all aspects of image analysis from the low-level, iconic processes of early vision to the high-level, symbolic processes of recognition and interpretation. A wide range of topics in the image understanding area is covered, including papers offering insights that differ from predominant views.
Research Areas Include:
• Theory
• Early vision
• Data structures and representations
• Shape
• Range
• Motion
• Matching and recognition
• Architecture and languages
• Vision systems