{"title":"Multimodal prompting and masking strategy for video-grounded dialogue","authors":"Feifei Xu , Wang Zhou , Fumiaoyue Jia","doi":"10.1016/j.knosys.2025.114367","DOIUrl":null,"url":null,"abstract":"<div><div>Video-Grounded Dialogue (VGD) is a challenging vision-language task aimed at engaging in multi-turn dialogues with humans based on video and audio content. Despite significant progress in improving AI-generated responses has been made, several challenges remain: 1) A significant amount of computing resources and time are required during training; 2) Current dominant approaches, utilizing T5 or GPT2 as base models, exhibit limited ability to understand video and audio features due to their text-based pre-training paradigms; 3) Existing studies have not addressed the robustness of models in real-world scenarios where dialog history is often missing. To address these issues, we propose VPM, a Video-Grounded Dialogue framework employing prompt-based tuning and a masking strategy. Firstly, to reduce computation resources, inspired by prompt learning, we are the first to employ prompt-based tuning in Video-Grounded Dialogue task by using only 20 % of the training set while maintaining proximal accuracy. Secondly, to enhance the model’s understanding of video and audio, we propose a slicing-based visual mapping network, integrating learnable visual prompts and video-audio slice features sequentially through a series of operations. Finally, we put forward an exponentially masking strategy for dialogue history to improve cross-modal understanding and robustness. Extensive experiments validate the effectiveness of our proposed framework, achieving state-of-the-art performance on the AVSD@DSTC7 and AVSD@DSTC8 datasets.</div></div>","PeriodicalId":49939,"journal":{"name":"Knowledge-Based Systems","volume":"329 ","pages":"Article 114367"},"PeriodicalIF":7.6000,"publicationDate":"2025-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Knowledge-Based Systems","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0950705125014066","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Video-Grounded Dialogue (VGD) is a challenging vision-language task aimed at engaging in multi-turn dialogues with humans based on video and audio content. Despite significant progress in improving AI-generated responses has been made, several challenges remain: 1) A significant amount of computing resources and time are required during training; 2) Current dominant approaches, utilizing T5 or GPT2 as base models, exhibit limited ability to understand video and audio features due to their text-based pre-training paradigms; 3) Existing studies have not addressed the robustness of models in real-world scenarios where dialog history is often missing. To address these issues, we propose VPM, a Video-Grounded Dialogue framework employing prompt-based tuning and a masking strategy. Firstly, to reduce computation resources, inspired by prompt learning, we are the first to employ prompt-based tuning in Video-Grounded Dialogue task by using only 20 % of the training set while maintaining proximal accuracy. Secondly, to enhance the model’s understanding of video and audio, we propose a slicing-based visual mapping network, integrating learnable visual prompts and video-audio slice features sequentially through a series of operations. Finally, we put forward an exponentially masking strategy for dialogue history to improve cross-modal understanding and robustness. Extensive experiments validate the effectiveness of our proposed framework, achieving state-of-the-art performance on the AVSD@DSTC7 and AVSD@DSTC8 datasets.
期刊介绍:
Knowledge-Based Systems, an international and interdisciplinary journal in artificial intelligence, publishes original, innovative, and creative research results in the field. It focuses on knowledge-based and other artificial intelligence techniques-based systems. The journal aims to support human prediction and decision-making through data science and computation techniques, provide a balanced coverage of theory and practical study, and encourage the development and implementation of knowledge-based intelligence models, methods, systems, and software tools. Applications in business, government, education, engineering, and healthcare are emphasized.