{"title":"Text–video retrieval re-ranking via multi-grained cross attention and frozen image encoders","authors":"Zuozhuo Dai , Kaihui Cheng , Fangtao Shao , Zilong Dong , Siyu Zhu","doi":"10.1016/j.patcog.2024.111099","DOIUrl":null,"url":null,"abstract":"<div><div>State-of-the-art methods for text–video retrieval generally leverage CLIP embeddings and cosine similarity for efficient retrieval. Meanwhile, recent advancements in cross-attention techniques introduce transformer decoders to facilitate attention computation between text queries and visual tokens extracted from video frames, enabling a more comprehensive interaction between textual and visual information. In this study, we combine the advantages of both approaches and propose a fine-grained re-ranking approach incorporating a multi-grained text–video cross attention module. Specifically, the re-ranker enhances the top K similar candidates identified by the cosine similarity network. To explore video and text interactions efficiently, we introduce frame and video token selectors to obtain salient visual tokens at both frame and video levels. Then, a multi-grained cross-attention mechanism is applied between text and visual tokens at these levels to capture multimodal information. To reduce the training overhead associated with the multi-grained cross-attention module, we freeze the vision backbone and only train the multi-grained cross attention module. This frozen strategy allows for scalability to larger pre-trained vision models such as ViT-G, leading to enhanced retrieval performance. Experimental evaluations on text–video retrieval datasets showcase the effectiveness and scalability of our proposed re-ranker combined with existing state-of-the-art methodologies.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"159 ","pages":"Article 111099"},"PeriodicalIF":7.5000,"publicationDate":"2024-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Pattern Recognition","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0031320324008501","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
State-of-the-art methods for text–video retrieval generally leverage CLIP embeddings and cosine similarity for efficient retrieval. Meanwhile, recent advancements in cross-attention techniques introduce transformer decoders to facilitate attention computation between text queries and visual tokens extracted from video frames, enabling a more comprehensive interaction between textual and visual information. In this study, we combine the advantages of both approaches and propose a fine-grained re-ranking approach incorporating a multi-grained text–video cross attention module. Specifically, the re-ranker enhances the top K similar candidates identified by the cosine similarity network. To explore video and text interactions efficiently, we introduce frame and video token selectors to obtain salient visual tokens at both frame and video levels. Then, a multi-grained cross-attention mechanism is applied between text and visual tokens at these levels to capture multimodal information. To reduce the training overhead associated with the multi-grained cross-attention module, we freeze the vision backbone and only train the multi-grained cross attention module. This frozen strategy allows for scalability to larger pre-trained vision models such as ViT-G, leading to enhanced retrieval performance. Experimental evaluations on text–video retrieval datasets showcase the effectiveness and scalability of our proposed re-ranker combined with existing state-of-the-art methodologies.
期刊介绍:
The field of Pattern Recognition is both mature and rapidly evolving, playing a crucial role in various related fields such as computer vision, image processing, text analysis, and neural networks. It closely intersects with machine learning and is being applied in emerging areas like biometrics, bioinformatics, multimedia data analysis, and data science. The journal Pattern Recognition, established half a century ago during the early days of computer science, has since grown significantly in scope and influence.