Attention-optimized vision-enhanced prompt learning for few-shot multi-modal sentiment analysis

Zikai Zhou, Baiyou Qiao, Haisong Feng, Donghong Han, Gang Wu
{"title":"Attention-optimized vision-enhanced prompt learning for few-shot multi-modal sentiment analysis","authors":"Zikai Zhou, Baiyou Qiao, Haisong Feng, Donghong Han, Gang Wu","doi":"10.1007/s00521-024-10297-w","DOIUrl":null,"url":null,"abstract":"<p>To fulfill the explosion of multi-modal data, multi-modal sentiment analysis (MSA) emerged and attracted widespread attention. Unfortunately, conventional multi-modal research relies on large-scale datasets. On the one hand, collecting and annotating large-scale datasets is challenging and resource-intensive. On the other hand, the training on large-scale datasets also increases the research cost. However, the few-shot MSA (FMSA), which is proposed recently, requires only few samples for training. Therefore, in comparison, it is more practical and realistic. There have been approaches to investigating the prompt-based method in the field of FMSA, but they have not sufficiently considered or leveraged the information specificity of visual modality. Thus, we propose a vision-enhanced prompt-based model based on graph structure to better utilize vision information for fusion and collaboration in encoding and optimizing prompt representations. Specifically, we first design an aggregation-based multi-modal attention module. Then, based on this module and the biaffine attention, we construct a syntax–semantic dual-channel graph convolutional network to optimize the encoding of learnable prompts by understanding the vision-enhanced information in semantic and syntactic knowledge. Finally, we propose a collaboration-based optimization module based on the collaborative attention mechanism, which employs visual information to collaboratively optimize prompt representations. Extensive experiments conducted on both coarse-grained and fine-grained MSA datasets have demonstrated that our model significantly outperforms the baseline models.</p>","PeriodicalId":18925,"journal":{"name":"Neural Computing and Applications","volume":"37 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Neural Computing and Applications","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1007/s00521-024-10297-w","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

To fulfill the explosion of multi-modal data, multi-modal sentiment analysis (MSA) emerged and attracted widespread attention. Unfortunately, conventional multi-modal research relies on large-scale datasets. On the one hand, collecting and annotating large-scale datasets is challenging and resource-intensive. On the other hand, the training on large-scale datasets also increases the research cost. However, the few-shot MSA (FMSA), which is proposed recently, requires only few samples for training. Therefore, in comparison, it is more practical and realistic. There have been approaches to investigating the prompt-based method in the field of FMSA, but they have not sufficiently considered or leveraged the information specificity of visual modality. Thus, we propose a vision-enhanced prompt-based model based on graph structure to better utilize vision information for fusion and collaboration in encoding and optimizing prompt representations. Specifically, we first design an aggregation-based multi-modal attention module. Then, based on this module and the biaffine attention, we construct a syntax–semantic dual-channel graph convolutional network to optimize the encoding of learnable prompts by understanding the vision-enhanced information in semantic and syntactic knowledge. Finally, we propose a collaboration-based optimization module based on the collaborative attention mechanism, which employs visual information to collaboratively optimize prompt representations. Extensive experiments conducted on both coarse-grained and fine-grained MSA datasets have demonstrated that our model significantly outperforms the baseline models.

Abstract Image

针对少镜头多模态情感分析的注意力优化视觉增强提示学习
为了应对多模态数据的爆炸式增长,多模态情感分析(MSA)应运而生,并引起了广泛关注。遗憾的是,传统的多模态研究依赖于大规模数据集。一方面,收集和注释大规模数据集是一项具有挑战性的资源密集型工作。另一方面,在大规模数据集上进行训练也增加了研究成本。然而,最近提出的少量样本 MSA(FMSA)只需要少量样本进行训练。因此,相比之下,它更实用、更现实。在 FMSA 领域,已经有研究基于提示的方法的方法,但这些方法没有充分考虑或利用视觉模式的信息特异性。因此,我们提出了一种基于图结构的视觉增强型提示模型,以便在编码和优化提示表征时更好地利用视觉信息进行融合与协作。具体来说,我们首先设计了一个基于聚合的多模态注意力模块。然后,基于该模块和双模注意力,我们构建了一个语法-语义双通道图卷积网络,通过理解视觉增强的语义和句法知识信息来优化可学习提示的编码。最后,我们提出了基于协作注意机制的协作优化模块,该模块利用视觉信息协作优化提示表征。在粗粒度和细粒度 MSA 数据集上进行的大量实验表明,我们的模型明显优于基线模型。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信