A unified prompt-based framework for few-shot multimodal language analysis

Xiaohan Zhang , Runmin Cao , Yifan Wang , Songze Li , Hua Xu , Kai Gao , Lunsong Huang
{"title":"A unified prompt-based framework for few-shot multimodal language analysis","authors":"Xiaohan Zhang ,&nbsp;Runmin Cao ,&nbsp;Yifan Wang ,&nbsp;Songze Li ,&nbsp;Hua Xu ,&nbsp;Kai Gao ,&nbsp;Lunsong Huang","doi":"10.1016/j.iswa.2025.200498","DOIUrl":null,"url":null,"abstract":"<div><div>Multimodal language analysis is a trending topic in NLP. It relies on large-scale annotated data, which is scarce due to its time-consuming and labor-intensive nature. Multimodal prompt learning has shown promise in low-resource scenarios. However, previous works either cannot handle semantically complex tasks, or involve too few modalities. In addition, most of them only focus on prompting language modality, disregarding the untapped potential of other modalities. We propose a unified prompt-based framework for few-shot multimodal language analysis . Specifically, based on pretrained language model, our model can handle semantically complex tasks involving text, audio and video modalities. To enable more effective utilization of video and audio modalities by the language model, we introduce semantic alignment pre-training to bridge the semantic gap between them and the language model, alongside effective fusion method for video and audio modalities. Additionally, we introduce a novel effective prompt method — Multimodal Prompt Encoder — to prompt the entirety of multimodal information. Extensive experiments conducted on six datasets under four multimodal language subtasks demonstrate the effectiveness of our approach.</div></div>","PeriodicalId":100684,"journal":{"name":"Intelligent Systems with Applications","volume":"26 ","pages":"Article 200498"},"PeriodicalIF":0.0000,"publicationDate":"2025-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Intelligent Systems with Applications","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2667305325000249","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Multimodal language analysis is a trending topic in NLP. It relies on large-scale annotated data, which is scarce due to its time-consuming and labor-intensive nature. Multimodal prompt learning has shown promise in low-resource scenarios. However, previous works either cannot handle semantically complex tasks, or involve too few modalities. In addition, most of them only focus on prompting language modality, disregarding the untapped potential of other modalities. We propose a unified prompt-based framework for few-shot multimodal language analysis . Specifically, based on pretrained language model, our model can handle semantically complex tasks involving text, audio and video modalities. To enable more effective utilization of video and audio modalities by the language model, we introduce semantic alignment pre-training to bridge the semantic gap between them and the language model, alongside effective fusion method for video and audio modalities. Additionally, we introduce a novel effective prompt method — Multimodal Prompt Encoder — to prompt the entirety of multimodal information. Extensive experiments conducted on six datasets under four multimodal language subtasks demonstrate the effectiveness of our approach.
一个统一的基于提示的框架,用于少量多模态语言分析
多模态语言分析是自然语言处理领域的研究热点。它依赖于大规模的带注释的数据,由于其耗时和劳动密集型的性质,这是稀缺的。多模式提示学习在资源匮乏的情况下显示出前景。然而,以前的工作要么不能处理语义复杂的任务,要么涉及的模式太少。此外,它们大多只关注提示语的情态,而忽视了其他情态尚未开发的潜力。我们提出了一个统一的基于提示的框架,用于少镜头多模态语言分析。具体而言,基于预训练的语言模型,我们的模型可以处理涉及文本、音频和视频模式的语义复杂任务。为了使语言模型能够更有效地利用视频和音频模式,我们引入了语义对齐预训练来弥合它们与语言模型之间的语义差距,以及有效的视频和音频模式融合方法。此外,我们还介绍了一种新颖有效的提示方法——多模态提示编码器,用于对多模态信息进行完整提示。在四个多模态语言子任务下的六个数据集上进行的大量实验证明了我们的方法的有效性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
CiteScore
5.60
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信