VLAB: Enhancing Video Language Pretraining by Feature Adapting and Blending

IF 8.4 1区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Multimedia Pub Date : 2024-12-25 DOI:10.1109/TMM.2024.3521729

Xingjian He;Sihan Chen;Fan Ma;Zhicheng Huang;Xiaojie Jin;Zikang Liu;Dongmei Fu;Yi Yang;Jing Liu;Jiashi Feng

{"title":"VLAB: Enhancing Video Language Pretraining by Feature Adapting and Blending","authors":"Xingjian He;Sihan Chen;Fan Ma;Zhicheng Huang;Xiaojie Jin;Zikang Liu;Dongmei Fu;Yi Yang;Jing Liu;Jiashi Feng","doi":"10.1109/TMM.2024.3521729","DOIUrl":null,"url":null,"abstract":"Large-scale image-text contrastive pre-training models, such as CLIP, have been demonstrated to effectively learn high-quality multimodal representations. However, there is limited research on learning video-text representations for general video multimodal tasks based on these powerful features. Towards this goal, we propose a novel video-text pre-training method dubbed VLAB: <bold>V</b>ideo <bold>L</b>anguage pre-training by feature <bold>A</b>dapting and <bold>B</b>lending, which transfers CLIP representations to video pre-training tasks and develops unified video multimodal models for a wide range of video-text tasks. Specifically, VLAB is founded on two key strategies: feature adapting and feature blending. In the former, we introduce a new video adapter module to address CLIP's deficiency in modeling temporal information and extend the model's capability to encompass both contrastive and generative tasks. In the latter, we propose an end-to-end training method that further enhances the model's performance by exploiting the complementarity of image and video features. We validate the effectiveness and versatility of VLAB through extensive experiments on highly competitive video multimodal tasks, including video text retrieval, video captioning, and video question answering. Remarkably, VLAB outperforms competing methods significantly and sets new records in video question answering on MSRVTT, MSVD, and TGIF datasets. It achieves an accuracy of 49.6, 60.9, and 79.0, respectively.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"2168-2180"},"PeriodicalIF":8.4000,"publicationDate":"2024-12-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Multimedia","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10814098/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

Large-scale image-text contrastive pre-training models, such as CLIP, have been demonstrated to effectively learn high-quality multimodal representations. However, there is limited research on learning video-text representations for general video multimodal tasks based on these powerful features. Towards this goal, we propose a novel video-text pre-training method dubbed VLAB: Video Language pre-training by feature Adapting and Blending, which transfers CLIP representations to video pre-training tasks and develops unified video multimodal models for a wide range of video-text tasks. Specifically, VLAB is founded on two key strategies: feature adapting and feature blending. In the former, we introduce a new video adapter module to address CLIP's deficiency in modeling temporal information and extend the model's capability to encompass both contrastive and generative tasks. In the latter, we propose an end-to-end training method that further enhances the model's performance by exploiting the complementarity of image and video features. We validate the effectiveness and versatility of VLAB through extensive experiments on highly competitive video multimodal tasks, including video text retrieval, video captioning, and video question answering. Remarkably, VLAB outperforms competing methods significantly and sets new records in video question answering on MSRVTT, MSVD, and TGIF datasets. It achieves an accuracy of 49.6, 60.9, and 79.0, respectively.

查看原文本刊更多论文

VLAB：通过特征适应和混合增强视频语言预训练

大规模的图像-文本对比预训练模型，如CLIP，已被证明可以有效地学习高质量的多模态表示。然而，基于这些强大的特征学习通用视频多模态任务的视频文本表示的研究有限。为了实现这一目标，我们提出了一种新的视频文本预训练方法，称为VLAB：基于特征适应和混合的视频语言预训练方法，该方法将CLIP表示转移到视频预训练任务中，并为广泛的视频文本任务开发了统一的视频多模态模型。具体来说，VLAB建立在两个关键策略上：特征适应和特征混合。在前者中，我们引入了一个新的视频适配器模块来解决CLIP在建模时间信息方面的不足，并扩展了模型的能力，以涵盖对比和生成任务。在后者中，我们提出了一种端到端训练方法，通过利用图像和视频特征的互补性进一步提高模型的性能。我们通过在高度竞争的视频多模态任务（包括视频文本检索、视频字幕和视频问答）上进行大量实验，验证了VLAB的有效性和多功能性。值得注意的是，VLAB在MSRVTT、MSVD和TGIF数据集上的视频问答方面明显优于竞争对手的方法，并创造了新的记录。它的准确率分别为49.6、60.9和79.0。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Transactions on Multimedia 工程技术-电信学

CiteScore

11.70

自引率

11.00%

发文量

576

审稿时长

5.5 months

期刊介绍： The IEEE Transactions on Multimedia delves into diverse aspects of multimedia technology and applications, covering circuits, networking, signal processing, systems, software, and systems integration. The scope aligns with the Fields of Interest of the sponsors, ensuring a comprehensive exploration of research in multimedia.