Attention-optimized vision-enhanced prompt learning for few-shot multi-modal sentiment analysis

Neural Computing and Applications Pub Date : 2024-08-22 DOI:10.1007/s00521-024-10297-w

Zikai Zhou, Baiyou Qiao, Haisong Feng, Donghong Han, Gang Wu

{"title":"Attention-optimized vision-enhanced prompt learning for few-shot multi-modal sentiment analysis","authors":"Zikai Zhou, Baiyou Qiao, Haisong Feng, Donghong Han, Gang Wu","doi":"10.1007/s00521-024-10297-w","DOIUrl":null,"url":null,"abstract":"<p>To fulfill the explosion of multi-modal data, multi-modal sentiment analysis (MSA) emerged and attracted widespread attention. Unfortunately, conventional multi-modal research relies on large-scale datasets. On the one hand, collecting and annotating large-scale datasets is challenging and resource-intensive. On the other hand, the training on large-scale datasets also increases the research cost. However, the few-shot MSA (FMSA), which is proposed recently, requires only few samples for training. Therefore, in comparison, it is more practical and realistic. There have been approaches to investigating the prompt-based method in the field of FMSA, but they have not sufficiently considered or leveraged the information specificity of visual modality. Thus, we propose a vision-enhanced prompt-based model based on graph structure to better utilize vision information for fusion and collaboration in encoding and optimizing prompt representations. Specifically, we first design an aggregation-based multi-modal attention module. Then, based on this module and the biaffine attention, we construct a syntax–semantic dual-channel graph convolutional network to optimize the encoding of learnable prompts by understanding the vision-enhanced information in semantic and syntactic knowledge. Finally, we propose a collaboration-based optimization module based on the collaborative attention mechanism, which employs visual information to collaboratively optimize prompt representations. Extensive experiments conducted on both coarse-grained and fine-grained MSA datasets have demonstrated that our model significantly outperforms the baseline models.</p>","PeriodicalId":18925,"journal":{"name":"Neural Computing and Applications","volume":"37 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Neural Computing and Applications","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1007/s00521-024-10297-w","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

To fulfill the explosion of multi-modal data, multi-modal sentiment analysis (MSA) emerged and attracted widespread attention. Unfortunately, conventional multi-modal research relies on large-scale datasets. On the one hand, collecting and annotating large-scale datasets is challenging and resource-intensive. On the other hand, the training on large-scale datasets also increases the research cost. However, the few-shot MSA (FMSA), which is proposed recently, requires only few samples for training. Therefore, in comparison, it is more practical and realistic. There have been approaches to investigating the prompt-based method in the field of FMSA, but they have not sufficiently considered or leveraged the information specificity of visual modality. Thus, we propose a vision-enhanced prompt-based model based on graph structure to better utilize vision information for fusion and collaboration in encoding and optimizing prompt representations. Specifically, we first design an aggregation-based multi-modal attention module. Then, based on this module and the biaffine attention, we construct a syntax–semantic dual-channel graph convolutional network to optimize the encoding of learnable prompts by understanding the vision-enhanced information in semantic and syntactic knowledge. Finally, we propose a collaboration-based optimization module based on the collaborative attention mechanism, which employs visual information to collaboratively optimize prompt representations. Extensive experiments conducted on both coarse-grained and fine-grained MSA datasets have demonstrated that our model significantly outperforms the baseline models.

Abstract Image

查看原文本刊更多论文

针对少镜头多模态情感分析的注意力优化视觉增强提示学习

为了应对多模态数据的爆炸式增长，多模态情感分析（MSA）应运而生，并引起了广泛关注。遗憾的是，传统的多模态研究依赖于大规模数据集。一方面，收集和注释大规模数据集是一项具有挑战性的资源密集型工作。另一方面，在大规模数据集上进行训练也增加了研究成本。然而，最近提出的少量样本 MSA（FMSA）只需要少量样本进行训练。因此，相比之下，它更实用、更现实。在 FMSA 领域，已经有研究基于提示的方法的方法，但这些方法没有充分考虑或利用视觉模式的信息特异性。因此，我们提出了一种基于图结构的视觉增强型提示模型，以便在编码和优化提示表征时更好地利用视觉信息进行融合与协作。具体来说，我们首先设计了一个基于聚合的多模态注意力模块。然后，基于该模块和双模注意力，我们构建了一个语法-语义双通道图卷积网络，通过理解视觉增强的语义和句法知识信息来优化可学习提示的编码。最后，我们提出了基于协作注意机制的协作优化模块，该模块利用视觉信息协作优化提示表征。在粗粒度和细粒度 MSA 数据集上进行的大量实验表明，我们的模型明显优于基线模型。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Neural Computing and Applications

自引率

0.00%

发文量