MEME: Multi-Encoder Multi-Expert Framework with Data Augmentation for Video Retrieval

Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval Pub Date : 2023-07-18 DOI:10.1145/3539618.3591726

Seong-Min Kang, Yoon-Sik Cho

{"title":"MEME: Multi-Encoder Multi-Expert Framework with Data Augmentation for Video Retrieval","authors":"Seong-Min Kang, Yoon-Sik Cho","doi":"10.1145/3539618.3591726","DOIUrl":null,"url":null,"abstract":"Text-to-video(T2V) retrieval aims to find relevant videos from text queries. The recently introduced Contrastive Language Image Pretraining (CLIP), a pretrained language-vision model trained on large-scale image and caption pairs, has been extensively studied in the literature for this task. Existing studies on T2V task have aimed to transfer the CLIP knowledge and focus on enhancing retrieval performance through fine-grained representation learning. While fine-grained contrast has achieved some remarkable results, less attention has been paid to coarse-grained contrasts. To this end, we propose a method called Graph Patch Spreading (GPS) to aggregate patches across frames at the coarse-grained level. We apply GPS to our proposed framework called Multi-Encoder Multi-Expert (MEME) framework. Our proposed scheme is general enough to be applied to any existing CLIP-based video-text retrieval models. We demonstrate the effectiveness of our method on existing models over the benchmark datasets MSR-VTT, MSVD, and LSMDC datasets. Our code can be found at https://github.com/kang7734/MEME__.","PeriodicalId":425056,"journal":{"name":"Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval","volume":"40 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3539618.3591726","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Text-to-video(T2V) retrieval aims to find relevant videos from text queries. The recently introduced Contrastive Language Image Pretraining (CLIP), a pretrained language-vision model trained on large-scale image and caption pairs, has been extensively studied in the literature for this task. Existing studies on T2V task have aimed to transfer the CLIP knowledge and focus on enhancing retrieval performance through fine-grained representation learning. While fine-grained contrast has achieved some remarkable results, less attention has been paid to coarse-grained contrasts. To this end, we propose a method called Graph Patch Spreading (GPS) to aggregate patches across frames at the coarse-grained level. We apply GPS to our proposed framework called Multi-Encoder Multi-Expert (MEME) framework. Our proposed scheme is general enough to be applied to any existing CLIP-based video-text retrieval models. We demonstrate the effectiveness of our method on existing models over the benchmark datasets MSR-VTT, MSVD, and LSMDC datasets. Our code can be found at https://github.com/kang7734/MEME__.

查看原文本刊更多论文

基于数据增强的视频检索多编码器多专家框架

文本到视频(T2V)检索的目的是从文本查询中找到相关的视频。最近引入的对比语言图像预训练(CLIP)是一种基于大规模图像和标题对训练的预训练语言视觉模型，已经在文献中得到了广泛的研究。现有的T2V任务的研究主要集中在通过细粒度表征学习来提高检索性能，并旨在转移CLIP知识。虽然细粒度对比已经取得了一些显著的结果，但粗粒度对比却很少受到关注。为此，我们提出了一种称为图形补丁扩展(GPS)的方法，在粗粒度级别上聚合帧间的补丁。我们将GPS应用到我们提出的多编码器多专家(MEME)框架中。我们提出的方案具有足够的通用性，可以应用于任何现有的基于clip的视频文本检索模型。我们在基准数据集MSR-VTT、MSVD和LSMDC数据集上证明了我们的方法在现有模型上的有效性。我们的代码可以在https://github.com/kang7734/MEME__上找到。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval

自引率

0.00%

发文量