Multimodal prompting and masking strategy for video-grounded dialogue

IF 7.6 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Knowledge-Based Systems Pub Date : 2025-09-08 DOI:10.1016/j.knosys.2025.114367

Feifei Xu , Wang Zhou , Fumiaoyue Jia

{"title":"Multimodal prompting and masking strategy for video-grounded dialogue","authors":"Feifei Xu , Wang Zhou , Fumiaoyue Jia","doi":"10.1016/j.knosys.2025.114367","DOIUrl":null,"url":null,"abstract":"<div><div>Video-Grounded Dialogue (VGD) is a challenging vision-language task aimed at engaging in multi-turn dialogues with humans based on video and audio content. Despite significant progress in improving AI-generated responses has been made, several challenges remain: 1) A significant amount of computing resources and time are required during training; 2) Current dominant approaches, utilizing T5 or GPT2 as base models, exhibit limited ability to understand video and audio features due to their text-based pre-training paradigms; 3) Existing studies have not addressed the robustness of models in real-world scenarios where dialog history is often missing. To address these issues, we propose VPM, a Video-Grounded Dialogue framework employing prompt-based tuning and a masking strategy. Firstly, to reduce computation resources, inspired by prompt learning, we are the first to employ prompt-based tuning in Video-Grounded Dialogue task by using only 20 % of the training set while maintaining proximal accuracy. Secondly, to enhance the model’s understanding of video and audio, we propose a slicing-based visual mapping network, integrating learnable visual prompts and video-audio slice features sequentially through a series of operations. Finally, we put forward an exponentially masking strategy for dialogue history to improve cross-modal understanding and robustness. Extensive experiments validate the effectiveness of our proposed framework, achieving state-of-the-art performance on the AVSD@DSTC7 and AVSD@DSTC8 datasets.</div></div>","PeriodicalId":49939,"journal":{"name":"Knowledge-Based Systems","volume":"329 ","pages":"Article 114367"},"PeriodicalIF":7.6000,"publicationDate":"2025-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Knowledge-Based Systems","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0950705125014066","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Video-Grounded Dialogue (VGD) is a challenging vision-language task aimed at engaging in multi-turn dialogues with humans based on video and audio content. Despite significant progress in improving AI-generated responses has been made, several challenges remain: 1) A significant amount of computing resources and time are required during training; 2) Current dominant approaches, utilizing T5 or GPT2 as base models, exhibit limited ability to understand video and audio features due to their text-based pre-training paradigms; 3) Existing studies have not addressed the robustness of models in real-world scenarios where dialog history is often missing. To address these issues, we propose VPM, a Video-Grounded Dialogue framework employing prompt-based tuning and a masking strategy. Firstly, to reduce computation resources, inspired by prompt learning, we are the first to employ prompt-based tuning in Video-Grounded Dialogue task by using only 20 % of the training set while maintaining proximal accuracy. Secondly, to enhance the model’s understanding of video and audio, we propose a slicing-based visual mapping network, integrating learnable visual prompts and video-audio slice features sequentially through a series of operations. Finally, we put forward an exponentially masking strategy for dialogue history to improve cross-modal understanding and robustness. Extensive experiments validate the effectiveness of our proposed framework, achieving state-of-the-art performance on the AVSD@DSTC7 and AVSD@DSTC8 datasets.

查看原文本刊更多论文

基于视频的对话的多模态提示和掩蔽策略

基于视频的对话（VGD）是一项具有挑战性的视觉语言任务，旨在基于视频和音频内容与人类进行多回合对话。尽管在改进人工智能生成的响应方面取得了重大进展，但仍然存在一些挑战：1)在训练期间需要大量的计算资源和时间；2)目前的主流方法，利用T5或GPT2作为基本模型，由于其基于文本的预训练范式，对视频和音频特征的理解能力有限；3)现有的研究并没有解决现实场景中对话历史经常缺失的模型的鲁棒性。为了解决这些问题，我们提出了VPM，一个基于视频的对话框架，采用基于提示的调谐和屏蔽策略。首先，为了减少计算资源，受提示学习的启发，我们首次在基于视频的对话任务中采用基于提示的调整，在保持接近准确率的情况下仅使用20%的训练集。其次，为了增强模型对视频和音频的理解，我们提出了一个基于切片的视觉映射网络，通过一系列操作将可学习的视觉提示和视频音频切片特征依次整合。最后，我们提出了对话历史的指数掩蔽策略，以提高跨模态理解和鲁棒性。广泛的实验验证了我们提出的框架的有效性，在AVSD@DSTC7和AVSD@DSTC8数据集上实现了最先进的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Knowledge-Based Systems 工程技术-计算机：人工智能

CiteScore

14.80

自引率

12.50%

发文量

1245

审稿时长

7.8 months

期刊介绍： Knowledge-Based Systems, an international and interdisciplinary journal in artificial intelligence, publishes original, innovative, and creative research results in the field. It focuses on knowledge-based and other artificial intelligence techniques-based systems. The journal aims to support human prediction and decision-making through data science and computation techniques, provide a balanced coverage of theory and practical study, and encourage the development and implementation of knowledge-based intelligence models, methods, systems, and software tools. Applications in business, government, education, engineering, and healthcare are emphasized.