Efficient Cross-Modal Video Retrieval With Meta-Optimized Frames

IF 8.4 1区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Multimedia Pub Date : 2024-06-28 DOI:10.1109/TMM.2024.3416669

Ning Han;Xun Yang;Ee-Peng Lim;Hao Chen;Qianru Sun

{"title":"Efficient Cross-Modal Video Retrieval With Meta-Optimized Frames","authors":"Ning Han;Xun Yang;Ee-Peng Lim;Hao Chen;Qianru Sun","doi":"10.1109/TMM.2024.3416669","DOIUrl":null,"url":null,"abstract":"Cross-modal video retrieval aims to retrieve semantically relevant videos when given a textual query, and is one of the fundamental multimedia tasks. Most top-performing methods primarily leverage Vision Transformer (ViT) to extract video features (Lei et al., 2021}, (Bain et al., 2021), (Wang et al., 2022). However, they suffer from the high computational complexity of ViT, especially when encoding long videos. A common and simple solution is to uniformly sample a small number (e.g., 4 or 8) of frames from the target video (instead of using the whole video) as ViT inputs. The number of frames has a strong influence on the performance of ViT, e.g., using 8 frames yields better performance than using 4 frames but requires more computational resources, resulting in a trade-off. To get free from this trade-off, this paper introduces an automatic video compression method based on a bilevel optimization program (BOP) consisting of both model-level (i.e., base-level) and frame-level (i.e., meta-level) optimizations. The model-level optimization process learns a cross-modal video retrieval model whose input includes the “compressed frames” learned by frame-level optimization. In turn, frame-level optimization is achieved through gradient descent using the meta loss of the video retrieval model computed on the whole video. We call this BOP method (as well as the “compressed frames”) the Meta-Optimized Frames (MOF) approach. By incorporating MOF, the video retrieval model is able to utilize the information of whole videos (for training) while taking only a small number of input frames in its actual implementation. The convergence of MOF is guaranteed by meta gradient descent algorithms. For evaluation purposes, we conduct extensive cross-modal video retrieval experiments on three large-scale benchmarks: MSR-VTT, MSVD, and DiDeMo. Our results show that MOF is a generic and efficient method that boost multiple baseline methods, and can achieve a new state-of-the-art performance.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"26 ","pages":"10924-10936"},"PeriodicalIF":8.4000,"publicationDate":"2024-06-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Multimedia","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10576688/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

Cross-modal video retrieval aims to retrieve semantically relevant videos when given a textual query, and is one of the fundamental multimedia tasks. Most top-performing methods primarily leverage Vision Transformer (ViT) to extract video features (Lei et al., 2021}, (Bain et al., 2021), (Wang et al., 2022). However, they suffer from the high computational complexity of ViT, especially when encoding long videos. A common and simple solution is to uniformly sample a small number (e.g., 4 or 8) of frames from the target video (instead of using the whole video) as ViT inputs. The number of frames has a strong influence on the performance of ViT, e.g., using 8 frames yields better performance than using 4 frames but requires more computational resources, resulting in a trade-off. To get free from this trade-off, this paper introduces an automatic video compression method based on a bilevel optimization program (BOP) consisting of both model-level (i.e., base-level) and frame-level (i.e., meta-level) optimizations. The model-level optimization process learns a cross-modal video retrieval model whose input includes the “compressed frames” learned by frame-level optimization. In turn, frame-level optimization is achieved through gradient descent using the meta loss of the video retrieval model computed on the whole video. We call this BOP method (as well as the “compressed frames”) the Meta-Optimized Frames (MOF) approach. By incorporating MOF, the video retrieval model is able to utilize the information of whole videos (for training) while taking only a small number of input frames in its actual implementation. The convergence of MOF is guaranteed by meta gradient descent algorithms. For evaluation purposes, we conduct extensive cross-modal video retrieval experiments on three large-scale benchmarks: MSR-VTT, MSVD, and DiDeMo. Our results show that MOF is a generic and efficient method that boost multiple baseline methods, and can achieve a new state-of-the-art performance.

查看原文本刊更多论文

利用元优化帧进行高效跨模态视频检索

跨模态视频检索的目的是在给出文本查询时检索语义相关的视频，是基本的多媒体任务之一。大多数性能优异的方法主要利用视觉变换器（ViT）来提取视频特征（Lei 等人，2021年}、（Bain 等人，2021年）、（Wang 等人，2022年）。然而，它们都受到 ViT 高计算复杂度的影响，尤其是在编码长视频时。一种常见而简单的解决方案是从目标视频中均匀采样少量（如 4 或 8 个）帧（而不是使用整个视频）作为 ViT 输入。帧数对 ViT 的性能有很大影响，例如，使用 8 个帧比使用 4 个帧的性能更好，但需要更多的计算资源，因此需要权衡。为了摆脱这种权衡，本文介绍了一种基于双级优化程序（BOP）的自动视频压缩方法，该程序由模型级（即基础级）和帧级（即元级）优化组成。模型级优化程序学习一个跨模态视频检索模型，其输入包括通过帧级优化学习到的 "压缩帧"。反过来，帧级优化通过梯度下降实现，使用的是在整个视频上计算的视频检索模型的元损失。我们称这种 BOP 方法（以及 "压缩帧"）为元优化帧（MOF）方法。通过采用 MOF，视频检索模型能够利用整个视频的信息（用于训练），同时在实际执行中只占用少量输入帧。元梯度下降算法保证了 MOF 的收敛性。为了进行评估，我们在三个大型基准上进行了广泛的跨模态视频检索实验：MSR-VTT、MSVD 和 DiDeMo。我们的结果表明，MOF 是一种通用而高效的方法，它能提升多种基准方法的性能，并能达到新的先进水平。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Transactions on Multimedia 工程技术-电信学

CiteScore

11.70

自引率

11.00%

发文量

576

审稿时长

5.5 months

期刊介绍： The IEEE Transactions on Multimedia delves into diverse aspects of multimedia technology and applications, covering circuits, networking, signal processing, systems, software, and systems integration. The scope aligns with the Fields of Interest of the sponsors, ensuring a comprehensive exploration of research in multimedia.