{"title":"Efficient Cross-Modal Video Retrieval With Meta-Optimized Frames","authors":"Ning Han;Xun Yang;Ee-Peng Lim;Hao Chen;Qianru Sun","doi":"10.1109/TMM.2024.3416669","DOIUrl":null,"url":null,"abstract":"Cross-modal video retrieval aims to retrieve semantically relevant videos when given a textual query, and is one of the fundamental multimedia tasks. Most top-performing methods primarily leverage Vision Transformer (ViT) to extract video features (Lei et al., 2021}, (Bain et al., 2021), (Wang et al., 2022). However, they suffer from the high computational complexity of ViT, especially when encoding long videos. A common and simple solution is to uniformly sample a small number (e.g., 4 or 8) of frames from the target video (instead of using the whole video) as ViT inputs. The number of frames has a strong influence on the performance of ViT, e.g., using 8 frames yields better performance than using 4 frames but requires more computational resources, resulting in a trade-off. To get free from this trade-off, this paper introduces an automatic video compression method based on a bilevel optimization program (BOP) consisting of both model-level (i.e., base-level) and frame-level (i.e., meta-level) optimizations. The model-level optimization process learns a cross-modal video retrieval model whose input includes the “compressed frames” learned by frame-level optimization. In turn, frame-level optimization is achieved through gradient descent using the meta loss of the video retrieval model computed on the whole video. We call this BOP method (as well as the “compressed frames”) the Meta-Optimized Frames (MOF) approach. By incorporating MOF, the video retrieval model is able to utilize the information of whole videos (for training) while taking only a small number of input frames in its actual implementation. The convergence of MOF is guaranteed by meta gradient descent algorithms. For evaluation purposes, we conduct extensive cross-modal video retrieval experiments on three large-scale benchmarks: MSR-VTT, MSVD, and DiDeMo. Our results show that MOF is a generic and efficient method that boost multiple baseline methods, and can achieve a new state-of-the-art performance.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"26 ","pages":"10924-10936"},"PeriodicalIF":8.4000,"publicationDate":"2024-06-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Multimedia","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10576688/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0
Abstract
Cross-modal video retrieval aims to retrieve semantically relevant videos when given a textual query, and is one of the fundamental multimedia tasks. Most top-performing methods primarily leverage Vision Transformer (ViT) to extract video features (Lei et al., 2021}, (Bain et al., 2021), (Wang et al., 2022). However, they suffer from the high computational complexity of ViT, especially when encoding long videos. A common and simple solution is to uniformly sample a small number (e.g., 4 or 8) of frames from the target video (instead of using the whole video) as ViT inputs. The number of frames has a strong influence on the performance of ViT, e.g., using 8 frames yields better performance than using 4 frames but requires more computational resources, resulting in a trade-off. To get free from this trade-off, this paper introduces an automatic video compression method based on a bilevel optimization program (BOP) consisting of both model-level (i.e., base-level) and frame-level (i.e., meta-level) optimizations. The model-level optimization process learns a cross-modal video retrieval model whose input includes the “compressed frames” learned by frame-level optimization. In turn, frame-level optimization is achieved through gradient descent using the meta loss of the video retrieval model computed on the whole video. We call this BOP method (as well as the “compressed frames”) the Meta-Optimized Frames (MOF) approach. By incorporating MOF, the video retrieval model is able to utilize the information of whole videos (for training) while taking only a small number of input frames in its actual implementation. The convergence of MOF is guaranteed by meta gradient descent algorithms. For evaluation purposes, we conduct extensive cross-modal video retrieval experiments on three large-scale benchmarks: MSR-VTT, MSVD, and DiDeMo. Our results show that MOF is a generic and efficient method that boost multiple baseline methods, and can achieve a new state-of-the-art performance.
期刊介绍:
The IEEE Transactions on Multimedia delves into diverse aspects of multimedia technology and applications, covering circuits, networking, signal processing, systems, software, and systems integration. The scope aligns with the Fields of Interest of the sponsors, ensuring a comprehensive exploration of research in multimedia.