{"title":"基于冻结视觉语言基础模型的时间建模用于参数高效文本视频检索。","authors":"Leqi Shen,Tianxiang Hao,Tao He,Yifeng Zhang,Pengzhang Liu,Sicheng Zhao,Jungong Han,Guiguang Ding","doi":"10.1109/tnnls.2025.3605657","DOIUrl":null,"url":null,"abstract":"Temporal modeling plays an important role in the effective adaption of the powerful pretrained text-image foundation model into text-video retrieval. However, existing methods often rely on additional heavy trainable modules, such as transformer or BiLSTM, which are inefficient. In contrast, we avoid introducing such heavy components by leveraging frozen foundation models. To this end, we propose temporal modeling with frozen vision-language foundation models (TFVL) to model the temporal dynamics with fixed encoders. Specifically, text encoder temporal modeling (TextTemp) and image encoder temporal modeling (ImageTemp) apply frozen text and image encoders within the video head and video backbone, respectively. TextTemp uses a frozen text encoder to interpret frame representations as \"visual words\" within a temporal \"sentence,\" capturing temporal dependencies. On the other hand, ImageTemp uses a frozen image encoder to treat all frame tokens as a unified visual entity, learning spatiotemporal information. The total trainable parameters of our method, comprising a lightweight projection and several prompt tokens, are significantly fewer than those in other existing methods. We evaluate the effectiveness of our method on MSR-VTT, DiDeMo, ActivityNet, and LSMDC. Compared with full fine-tuning on MSR-VTT, our TFVL achieves an average 3.25% gain in R@1 with merely 0.35% of the parameters. Extensive experiments demonstrate that the proposed TFVL outperforms state-of-the-art methods with significantly fewer parameters.","PeriodicalId":13303,"journal":{"name":"IEEE transactions on neural networks and learning systems","volume":"15 1","pages":""},"PeriodicalIF":8.9000,"publicationDate":"2025-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Temporal Modeling With Frozen Vision-Language Foundation Models for Parameter-Efficient Text-Video Retrieval.\",\"authors\":\"Leqi Shen,Tianxiang Hao,Tao He,Yifeng Zhang,Pengzhang Liu,Sicheng Zhao,Jungong Han,Guiguang Ding\",\"doi\":\"10.1109/tnnls.2025.3605657\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Temporal modeling plays an important role in the effective adaption of the powerful pretrained text-image foundation model into text-video retrieval. However, existing methods often rely on additional heavy trainable modules, such as transformer or BiLSTM, which are inefficient. In contrast, we avoid introducing such heavy components by leveraging frozen foundation models. To this end, we propose temporal modeling with frozen vision-language foundation models (TFVL) to model the temporal dynamics with fixed encoders. Specifically, text encoder temporal modeling (TextTemp) and image encoder temporal modeling (ImageTemp) apply frozen text and image encoders within the video head and video backbone, respectively. TextTemp uses a frozen text encoder to interpret frame representations as \\\"visual words\\\" within a temporal \\\"sentence,\\\" capturing temporal dependencies. On the other hand, ImageTemp uses a frozen image encoder to treat all frame tokens as a unified visual entity, learning spatiotemporal information. The total trainable parameters of our method, comprising a lightweight projection and several prompt tokens, are significantly fewer than those in other existing methods. We evaluate the effectiveness of our method on MSR-VTT, DiDeMo, ActivityNet, and LSMDC. Compared with full fine-tuning on MSR-VTT, our TFVL achieves an average 3.25% gain in R@1 with merely 0.35% of the parameters. Extensive experiments demonstrate that the proposed TFVL outperforms state-of-the-art methods with significantly fewer parameters.\",\"PeriodicalId\":13303,\"journal\":{\"name\":\"IEEE transactions on neural networks and learning systems\",\"volume\":\"15 1\",\"pages\":\"\"},\"PeriodicalIF\":8.9000,\"publicationDate\":\"2025-09-09\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE transactions on neural networks and learning systems\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/10.1109/tnnls.2025.3605657\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on neural networks and learning systems","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1109/tnnls.2025.3605657","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
Temporal Modeling With Frozen Vision-Language Foundation Models for Parameter-Efficient Text-Video Retrieval.
Temporal modeling plays an important role in the effective adaption of the powerful pretrained text-image foundation model into text-video retrieval. However, existing methods often rely on additional heavy trainable modules, such as transformer or BiLSTM, which are inefficient. In contrast, we avoid introducing such heavy components by leveraging frozen foundation models. To this end, we propose temporal modeling with frozen vision-language foundation models (TFVL) to model the temporal dynamics with fixed encoders. Specifically, text encoder temporal modeling (TextTemp) and image encoder temporal modeling (ImageTemp) apply frozen text and image encoders within the video head and video backbone, respectively. TextTemp uses a frozen text encoder to interpret frame representations as "visual words" within a temporal "sentence," capturing temporal dependencies. On the other hand, ImageTemp uses a frozen image encoder to treat all frame tokens as a unified visual entity, learning spatiotemporal information. The total trainable parameters of our method, comprising a lightweight projection and several prompt tokens, are significantly fewer than those in other existing methods. We evaluate the effectiveness of our method on MSR-VTT, DiDeMo, ActivityNet, and LSMDC. Compared with full fine-tuning on MSR-VTT, our TFVL achieves an average 3.25% gain in R@1 with merely 0.35% of the parameters. Extensive experiments demonstrate that the proposed TFVL outperforms state-of-the-art methods with significantly fewer parameters.
期刊介绍:
The focus of IEEE Transactions on Neural Networks and Learning Systems is to present scholarly articles discussing the theory, design, and applications of neural networks as well as other learning systems. The journal primarily highlights technical and scientific research in this domain.