Temporal Modeling With Frozen Vision-Language Foundation Models for Parameter-Efficient Text-Video Retrieval.

IF 8.9 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

IEEE transactions on neural networks and learning systems Pub Date : 2025-09-09 DOI:10.1109/tnnls.2025.3605657

Leqi Shen,Tianxiang Hao,Tao He,Yifeng Zhang,Pengzhang Liu,Sicheng Zhao,Jungong Han,Guiguang Ding

{"title":"Temporal Modeling With Frozen Vision-Language Foundation Models for Parameter-Efficient Text-Video Retrieval.","authors":"Leqi Shen,Tianxiang Hao,Tao He,Yifeng Zhang,Pengzhang Liu,Sicheng Zhao,Jungong Han,Guiguang Ding","doi":"10.1109/tnnls.2025.3605657","DOIUrl":null,"url":null,"abstract":"Temporal modeling plays an important role in the effective adaption of the powerful pretrained text-image foundation model into text-video retrieval. However, existing methods often rely on additional heavy trainable modules, such as transformer or BiLSTM, which are inefficient. In contrast, we avoid introducing such heavy components by leveraging frozen foundation models. To this end, we propose temporal modeling with frozen vision-language foundation models (TFVL) to model the temporal dynamics with fixed encoders. Specifically, text encoder temporal modeling (TextTemp) and image encoder temporal modeling (ImageTemp) apply frozen text and image encoders within the video head and video backbone, respectively. TextTemp uses a frozen text encoder to interpret frame representations as \"visual words\" within a temporal \"sentence,\" capturing temporal dependencies. On the other hand, ImageTemp uses a frozen image encoder to treat all frame tokens as a unified visual entity, learning spatiotemporal information. The total trainable parameters of our method, comprising a lightweight projection and several prompt tokens, are significantly fewer than those in other existing methods. We evaluate the effectiveness of our method on MSR-VTT, DiDeMo, ActivityNet, and LSMDC. Compared with full fine-tuning on MSR-VTT, our TFVL achieves an average 3.25% gain in R@1 with merely 0.35% of the parameters. Extensive experiments demonstrate that the proposed TFVL outperforms state-of-the-art methods with significantly fewer parameters.","PeriodicalId":13303,"journal":{"name":"IEEE transactions on neural networks and learning systems","volume":"15 1","pages":""},"PeriodicalIF":8.9000,"publicationDate":"2025-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on neural networks and learning systems","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1109/tnnls.2025.3605657","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Temporal modeling plays an important role in the effective adaption of the powerful pretrained text-image foundation model into text-video retrieval. However, existing methods often rely on additional heavy trainable modules, such as transformer or BiLSTM, which are inefficient. In contrast, we avoid introducing such heavy components by leveraging frozen foundation models. To this end, we propose temporal modeling with frozen vision-language foundation models (TFVL) to model the temporal dynamics with fixed encoders. Specifically, text encoder temporal modeling (TextTemp) and image encoder temporal modeling (ImageTemp) apply frozen text and image encoders within the video head and video backbone, respectively. TextTemp uses a frozen text encoder to interpret frame representations as "visual words" within a temporal "sentence," capturing temporal dependencies. On the other hand, ImageTemp uses a frozen image encoder to treat all frame tokens as a unified visual entity, learning spatiotemporal information. The total trainable parameters of our method, comprising a lightweight projection and several prompt tokens, are significantly fewer than those in other existing methods. We evaluate the effectiveness of our method on MSR-VTT, DiDeMo, ActivityNet, and LSMDC. Compared with full fine-tuning on MSR-VTT, our TFVL achieves an average 3.25% gain in R@1 with merely 0.35% of the parameters. Extensive experiments demonstrate that the proposed TFVL outperforms state-of-the-art methods with significantly fewer parameters.

查看原文本刊更多论文

基于冻结视觉语言基础模型的时间建模用于参数高效文本视频检索。

时间建模在将强大的预训练文本-图像基础模型有效地应用于文本-视频检索中起着重要的作用。然而，现有的方法往往依赖于额外的重型可训练模块，如变压器或BiLSTM，这是低效的。相反，我们通过利用冻结的基础模型来避免引入如此沉重的组件。为此，我们提出了使用冻结视觉语言基础模型（TFVL）进行时间建模，以模拟固定编码器的时间动态。具体来说，文本编码器时间建模（TextTemp）和图像编码器时间建模（ImageTemp）分别在视频头部和视频骨干中应用冻结的文本和图像编码器。TextTemp使用冻结文本编码器将帧表示解释为时间“句子”中的“视觉单词”，从而捕获时间依赖性。另一方面，ImageTemp使用冻结图像编码器将所有帧令牌视为统一的视觉实体，学习时空信息。我们的方法的总可训练参数，包括一个轻量级的投影和几个提示符号，明显少于其他现有方法。我们评估了我们的方法在MSR-VTT、DiDeMo、ActivityNet和LSMDC上的有效性。与MSR-VTT的完全微调相比，我们的TFVL在R@1中仅使用0.35%的参数即可实现平均3.25%的增益。大量的实验表明，所提出的TFVL在参数显著减少的情况下优于最先进的方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE transactions on neural networks and learning systems COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE-COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

CiteScore

23.80

自引率

9.60%

发文量

2102

审稿时长

3-8 weeks

期刊介绍： The focus of IEEE Transactions on Neural Networks and Learning Systems is to present scholarly articles discussing the theory, design, and applications of neural networks as well as other learning systems. The journal primarily highlights technical and scientific research in this domain.