TC-MGC: Text-conditioned multi-grained contrastive learning for text–video retrieval

IF 14.7 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Information Fusion Pub Date : 2025-04-05 DOI:10.1016/j.inffus.2025.103151

Xiaolun Jing, Genke Yang, Jian Chu

{"title":"TC-MGC: Text-conditioned multi-grained contrastive learning for text–video retrieval","authors":"Xiaolun Jing, Genke Yang, Jian Chu","doi":"10.1016/j.inffus.2025.103151","DOIUrl":null,"url":null,"abstract":"<div><div>Motivated by the success of coarse-grained or fine-grained contrast in text–video retrieval, there emerge multi-grained contrastive learning methods which focus on the integration of contrasts with different granularity. However, due to the wider semantic range of videos, the text-agnostic video representations might encode misleading information not described in texts, thus impeding the model from capturing precise cross-modal semantic correspondence. To this end, we propose a Text-Conditioned Multi-Grained Contrast framework, dubbed TC-MGC. Specifically, our model employs a language–video attention block to generate aggregated frame and video representations conditioned on the word’s and text’s attention weights over frames. To filter unnecessary similarity interactions and decrease trainable parameters in the Interactive Similarity Aggregation (ISA) module, we design a Similarity Reorganization (SR) module to identify attentive similarities and reorganize cross-modal similarity vectors and matrices. Next, we argue that the imbalance problem among multi-grained similarities may result in over- and under-representation issues. We thereby introduce an auxiliary Similarity Decorrelation Regularization (SDR) loss to facilitate cooperative relationship utilization by similarity variance minimization on matching text–video pairs. Finally, we present a Linear Softmax Aggregation (LSA) module to explicitly encourage the interactions between multiple similarities and promote the usage of multi-grained information. Empirically, TC-MGC achieves competitive results on multiple text–video retrieval benchmarks, outperforming X-CLIP model by ＋2.8% (＋1.3%), ＋2.2% (＋1.0%), ＋1.5% (＋0.9%) relative (absolute) improvements in text-to-video retrieval R@1 on MSR-VTT, DiDeMo and VATEX, respectively. Our code is publicly available at <span><span>https://github.com/JingXiaolun/TC-MGC</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50367,"journal":{"name":"Information Fusion","volume":"121 ","pages":"Article 103151"},"PeriodicalIF":14.7000,"publicationDate":"2025-04-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information Fusion","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1566253525002246","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Motivated by the success of coarse-grained or fine-grained contrast in text–video retrieval, there emerge multi-grained contrastive learning methods which focus on the integration of contrasts with different granularity. However, due to the wider semantic range of videos, the text-agnostic video representations might encode misleading information not described in texts, thus impeding the model from capturing precise cross-modal semantic correspondence. To this end, we propose a Text-Conditioned Multi-Grained Contrast framework, dubbed TC-MGC. Specifically, our model employs a language–video attention block to generate aggregated frame and video representations conditioned on the word’s and text’s attention weights over frames. To filter unnecessary similarity interactions and decrease trainable parameters in the Interactive Similarity Aggregation (ISA) module, we design a Similarity Reorganization (SR) module to identify attentive similarities and reorganize cross-modal similarity vectors and matrices. Next, we argue that the imbalance problem among multi-grained similarities may result in over- and under-representation issues. We thereby introduce an auxiliary Similarity Decorrelation Regularization (SDR) loss to facilitate cooperative relationship utilization by similarity variance minimization on matching text–video pairs. Finally, we present a Linear Softmax Aggregation (LSA) module to explicitly encourage the interactions between multiple similarities and promote the usage of multi-grained information. Empirically, TC-MGC achieves competitive results on multiple text–video retrieval benchmarks, outperforming X-CLIP model by ＋2.8% (＋1.3%), ＋2.2% (＋1.0%), ＋1.5% (＋0.9%) relative (absolute) improvements in text-to-video retrieval R@1 on MSR-VTT, DiDeMo and VATEX, respectively. Our code is publicly available at https://github.com/JingXiaolun/TC-MGC.

查看原文本刊更多论文

用于文本视频检索的文本条件多粒度对比学习

随着粗粒度和细粒度对比在文本视频检索中的成功应用，多粒度对比学习方法应运而生，其重点是对不同粒度的对比进行整合。然而，由于视频的语义范围更广，与文本无关的视频表示可能会编码文本中未描述的误导性信息，从而阻碍模型捕获精确的跨模态语义对应。为此，我们提出了一个文本条件多粒度对比框架，称为TC-MGC。具体来说，我们的模型采用语言-视频注意力块来生成聚合帧和视频表示，这取决于单词和文本在帧上的注意力权重。为了在交互相似聚合（ISA）模块中过滤不必要的相似交互并减少可训练参数，我们设计了一个相似重组（SR）模块来识别关注的相似点并重组跨模态相似向量和矩阵。接下来，我们认为多粒度相似性之间的不平衡问题可能导致过度和不足的表示问题。因此，我们引入了一种辅助的相似度去相关正则化（SDR）损失，通过最小化匹配文本视频对的相似度方差来促进合作关系的利用。最后，我们提出了一个线性Softmax聚合（LSA）模块来明确地鼓励多个相似性之间的交互，并促进多粒度信息的使用。从经验上看，TC-MGC在多个文本-视频检索基准上取得了具有竞争力的结果，在MSR-VTT、DiDeMo和VATEX上，在文本到视频检索R@1方面的相对（绝对）改进分别优于X-CLIP模型+2.8%（+1.3%）、+2.2%（+1.0%）、+1.5%（+0.9%）。我们的代码可以在https://github.com/JingXiaolun/TC-MGC上公开获得。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Information Fusion 工程技术-计算机：理论方法

CiteScore

33.20

自引率

4.30%

发文量

161

审稿时长

7.9 months

期刊介绍： Information Fusion serves as a central platform for showcasing advancements in multi-sensor, multi-source, multi-process information fusion, fostering collaboration among diverse disciplines driving its progress. It is the leading outlet for sharing research and development in this field, focusing on architectures, algorithms, and applications. Papers dealing with fundamental theoretical analyses as well as those demonstrating their application to real-world problems will be welcome.