Xiaohan Zhang , Runmin Cao , Yifan Wang , Songze Li , Hua Xu , Kai Gao , Lunsong Huang
{"title":"一个统一的基于提示的框架,用于少量多模态语言分析","authors":"Xiaohan Zhang , Runmin Cao , Yifan Wang , Songze Li , Hua Xu , Kai Gao , Lunsong Huang","doi":"10.1016/j.iswa.2025.200498","DOIUrl":null,"url":null,"abstract":"<div><div>Multimodal language analysis is a trending topic in NLP. It relies on large-scale annotated data, which is scarce due to its time-consuming and labor-intensive nature. Multimodal prompt learning has shown promise in low-resource scenarios. However, previous works either cannot handle semantically complex tasks, or involve too few modalities. In addition, most of them only focus on prompting language modality, disregarding the untapped potential of other modalities. We propose a unified prompt-based framework for few-shot multimodal language analysis . Specifically, based on pretrained language model, our model can handle semantically complex tasks involving text, audio and video modalities. To enable more effective utilization of video and audio modalities by the language model, we introduce semantic alignment pre-training to bridge the semantic gap between them and the language model, alongside effective fusion method for video and audio modalities. Additionally, we introduce a novel effective prompt method — Multimodal Prompt Encoder — to prompt the entirety of multimodal information. Extensive experiments conducted on six datasets under four multimodal language subtasks demonstrate the effectiveness of our approach.</div></div>","PeriodicalId":100684,"journal":{"name":"Intelligent Systems with Applications","volume":"26 ","pages":"Article 200498"},"PeriodicalIF":0.0000,"publicationDate":"2025-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A unified prompt-based framework for few-shot multimodal language analysis\",\"authors\":\"Xiaohan Zhang , Runmin Cao , Yifan Wang , Songze Li , Hua Xu , Kai Gao , Lunsong Huang\",\"doi\":\"10.1016/j.iswa.2025.200498\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Multimodal language analysis is a trending topic in NLP. It relies on large-scale annotated data, which is scarce due to its time-consuming and labor-intensive nature. Multimodal prompt learning has shown promise in low-resource scenarios. However, previous works either cannot handle semantically complex tasks, or involve too few modalities. In addition, most of them only focus on prompting language modality, disregarding the untapped potential of other modalities. We propose a unified prompt-based framework for few-shot multimodal language analysis . Specifically, based on pretrained language model, our model can handle semantically complex tasks involving text, audio and video modalities. To enable more effective utilization of video and audio modalities by the language model, we introduce semantic alignment pre-training to bridge the semantic gap between them and the language model, alongside effective fusion method for video and audio modalities. Additionally, we introduce a novel effective prompt method — Multimodal Prompt Encoder — to prompt the entirety of multimodal information. Extensive experiments conducted on six datasets under four multimodal language subtasks demonstrate the effectiveness of our approach.</div></div>\",\"PeriodicalId\":100684,\"journal\":{\"name\":\"Intelligent Systems with Applications\",\"volume\":\"26 \",\"pages\":\"Article 200498\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2025-03-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Intelligent Systems with Applications\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S2667305325000249\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Intelligent Systems with Applications","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2667305325000249","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
A unified prompt-based framework for few-shot multimodal language analysis
Multimodal language analysis is a trending topic in NLP. It relies on large-scale annotated data, which is scarce due to its time-consuming and labor-intensive nature. Multimodal prompt learning has shown promise in low-resource scenarios. However, previous works either cannot handle semantically complex tasks, or involve too few modalities. In addition, most of them only focus on prompting language modality, disregarding the untapped potential of other modalities. We propose a unified prompt-based framework for few-shot multimodal language analysis . Specifically, based on pretrained language model, our model can handle semantically complex tasks involving text, audio and video modalities. To enable more effective utilization of video and audio modalities by the language model, we introduce semantic alignment pre-training to bridge the semantic gap between them and the language model, alongside effective fusion method for video and audio modalities. Additionally, we introduce a novel effective prompt method — Multimodal Prompt Encoder — to prompt the entirety of multimodal information. Extensive experiments conducted on six datasets under four multimodal language subtasks demonstrate the effectiveness of our approach.