用于多模态情感分析的全球信息监管网络

IF 4.2 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE
Shufan Xie, Qiaohong Chen, Xian Fang, Qi Sun
{"title":"用于多模态情感分析的全球信息监管网络","authors":"Shufan Xie,&nbsp;Qiaohong Chen,&nbsp;Xian Fang,&nbsp;Qi Sun","doi":"10.1016/j.imavis.2024.105297","DOIUrl":null,"url":null,"abstract":"<div><div>Human language is considered multimodal, containing natural language, visual elements, and acoustic signals. Multimodal Sentiment Analysis (MSA) concentrates on the integration of various modalities to capture the sentiment polarity or intensity expressed in human language. Nevertheless, the absence of a comprehensive strategy for processing and integrating multimodal representations results in the inclusion of inaccurate or noisy data from diverse modalities in the ultimate decision-making process, potentially leading to the neglect of crucial information within or across modalities. To address this issue, we propose the Global Information Regulation Network (GIRN), a novel framework designed to regulate information flow and decision-making processes across various stages, ranging from unimodal feature extraction to multimodal outcome prediction. Specifically, before modal fusion stage, we maximize the mutual information between modalities and refine the input signals through random feature erasing, yielding a more robust unimodal representation. In the process of modal fusion, we enhance the traditional Transformer encoder through the gate mechanism and stacked attention to dynamically fuse the target and auxiliary modalities. After modal fusion, cross-hierarchical contrastive learning and decision gate are employed to integrate the valuable information represented in different categories and hierarchies. Extensive experiments conducted on the CMU-MOSI and CMU-MOSEI datasets suggest that our methodology outperforms existing approaches across nearly all criteria.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"151 ","pages":"Article 105297"},"PeriodicalIF":4.2000,"publicationDate":"2024-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Global information regulation network for multimodal sentiment analysis\",\"authors\":\"Shufan Xie,&nbsp;Qiaohong Chen,&nbsp;Xian Fang,&nbsp;Qi Sun\",\"doi\":\"10.1016/j.imavis.2024.105297\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Human language is considered multimodal, containing natural language, visual elements, and acoustic signals. Multimodal Sentiment Analysis (MSA) concentrates on the integration of various modalities to capture the sentiment polarity or intensity expressed in human language. Nevertheless, the absence of a comprehensive strategy for processing and integrating multimodal representations results in the inclusion of inaccurate or noisy data from diverse modalities in the ultimate decision-making process, potentially leading to the neglect of crucial information within or across modalities. To address this issue, we propose the Global Information Regulation Network (GIRN), a novel framework designed to regulate information flow and decision-making processes across various stages, ranging from unimodal feature extraction to multimodal outcome prediction. Specifically, before modal fusion stage, we maximize the mutual information between modalities and refine the input signals through random feature erasing, yielding a more robust unimodal representation. In the process of modal fusion, we enhance the traditional Transformer encoder through the gate mechanism and stacked attention to dynamically fuse the target and auxiliary modalities. After modal fusion, cross-hierarchical contrastive learning and decision gate are employed to integrate the valuable information represented in different categories and hierarchies. Extensive experiments conducted on the CMU-MOSI and CMU-MOSEI datasets suggest that our methodology outperforms existing approaches across nearly all criteria.</div></div>\",\"PeriodicalId\":50374,\"journal\":{\"name\":\"Image and Vision Computing\",\"volume\":\"151 \",\"pages\":\"Article 105297\"},\"PeriodicalIF\":4.2000,\"publicationDate\":\"2024-10-10\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Image and Vision Computing\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0262885624004025\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Image and Vision Computing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0262885624004025","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0

摘要

人类语言被认为是多模态的,包含自然语言、视觉元素和声音信号。多模态情感分析(MSA)侧重于整合各种模态,以捕捉人类语言中表达的情感极性或强度。然而,由于缺乏处理和整合多模态表征的综合策略,在最终决策过程中会包含来自不同模态的不准确或有噪声的数据,从而可能导致忽略模态内或模态间的关键信息。为了解决这个问题,我们提出了全球信息调节网络(GIRN),这是一个新颖的框架,旨在调节从单模态特征提取到多模态结果预测等各个阶段的信息流和决策过程。具体来说,在模态融合阶段之前,我们会最大化模态之间的互信息,并通过随机特征擦除来完善输入信号,从而获得更稳健的单模态表示。在模态融合过程中,我们通过门机制和叠加注意力来增强传统的 Transformer 编码器,从而实现目标模态和辅助模态的动态融合。模态融合后,我们采用跨层级对比学习和决策门来整合不同类别和层级所代表的有价值信息。在 CMU-MOSI 和 CMU-MOSEI 数据集上进行的大量实验表明,我们的方法在几乎所有标准上都优于现有方法。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Global information regulation network for multimodal sentiment analysis
Human language is considered multimodal, containing natural language, visual elements, and acoustic signals. Multimodal Sentiment Analysis (MSA) concentrates on the integration of various modalities to capture the sentiment polarity or intensity expressed in human language. Nevertheless, the absence of a comprehensive strategy for processing and integrating multimodal representations results in the inclusion of inaccurate or noisy data from diverse modalities in the ultimate decision-making process, potentially leading to the neglect of crucial information within or across modalities. To address this issue, we propose the Global Information Regulation Network (GIRN), a novel framework designed to regulate information flow and decision-making processes across various stages, ranging from unimodal feature extraction to multimodal outcome prediction. Specifically, before modal fusion stage, we maximize the mutual information between modalities and refine the input signals through random feature erasing, yielding a more robust unimodal representation. In the process of modal fusion, we enhance the traditional Transformer encoder through the gate mechanism and stacked attention to dynamically fuse the target and auxiliary modalities. After modal fusion, cross-hierarchical contrastive learning and decision gate are employed to integrate the valuable information represented in different categories and hierarchies. Extensive experiments conducted on the CMU-MOSI and CMU-MOSEI datasets suggest that our methodology outperforms existing approaches across nearly all criteria.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Image and Vision Computing
Image and Vision Computing 工程技术-工程:电子与电气
CiteScore
8.50
自引率
8.50%
发文量
143
审稿时长
7.8 months
期刊介绍: Image and Vision Computing has as a primary aim the provision of an effective medium of interchange for the results of high quality theoretical and applied research fundamental to all aspects of image interpretation and computer vision. The journal publishes work that proposes new image interpretation and computer vision methodology or addresses the application of such methods to real world scenes. It seeks to strengthen a deeper understanding in the discipline by encouraging the quantitative comparison and performance evaluation of the proposed methodology. The coverage includes: image interpretation, scene modelling, object recognition and tracking, shape analysis, monitoring and surveillance, active vision and robotic systems, SLAM, biologically-inspired computer vision, motion analysis, stereo vision, document image understanding, character and handwritten text recognition, face and gesture recognition, biometrics, vision-based human-computer interaction, human activity and behavior understanding, data fusion from multiple sensor inputs, image databases.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信