Weitong Cai , Jiabo Huang , Shaogang Gong , Hailin Jin , Yang Liu
{"title":"MLLM as video narrator: Mitigating modality imbalance in video moment retrieval","authors":"Weitong Cai , Jiabo Huang , Shaogang Gong , Hailin Jin , Yang Liu","doi":"10.1016/j.patcog.2025.111670","DOIUrl":null,"url":null,"abstract":"<div><div>Video Moment Retrieval (VMR) aims to localize a specific temporal segment within an untrimmed long video given a natural language query. Existing methods often suffer from inadequate training annotations, <em>i.e.</em>, the sentence typically matches with a fraction of the prominent video content in the foreground with limited wording diversity. This intrinsic modality imbalance leaves a considerable portion of visual information remaining unaligned with text. It confines the cross-modal alignment knowledge within the scope of a limited text corpus, thereby leading to sub-optimal visual-textual modeling and poor generalizability. By leveraging the visual-textual understanding capability of multi-modal large language models (MLLM), in this work, we propose a novel MLLM-driven framework Text-Enhanced Alignment (TEA), to address the modality imbalance problem by enhancing the correlated visual-textual knowledge. TEA takes an MLLM as a video narrator to generate plausible textual descriptions of the video, thereby mitigating the modality imbalance and boosting the temporal localization. To effectively maintain temporal sensibility for localization, we design to get text narratives for each certain video timestamp and construct a structured text paragraph with time information, which is temporally aligned with the visual content. Then we perform cross-modal feature merging between the temporal-aware narratives and corresponding video temporal features to produce semantic-enhanced video representation sequences for query localization. Subsequently, we introduce a uni-modal narrative-query matching mechanism, which encourages the model to extract complementary information from contextual cohesive descriptions for improved retrieval. Extensive experiments on two benchmarks show the effectiveness and generalizability of our proposed method.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"166 ","pages":"Article 111670"},"PeriodicalIF":7.5000,"publicationDate":"2025-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Pattern Recognition","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0031320325003309","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Video Moment Retrieval (VMR) aims to localize a specific temporal segment within an untrimmed long video given a natural language query. Existing methods often suffer from inadequate training annotations, i.e., the sentence typically matches with a fraction of the prominent video content in the foreground with limited wording diversity. This intrinsic modality imbalance leaves a considerable portion of visual information remaining unaligned with text. It confines the cross-modal alignment knowledge within the scope of a limited text corpus, thereby leading to sub-optimal visual-textual modeling and poor generalizability. By leveraging the visual-textual understanding capability of multi-modal large language models (MLLM), in this work, we propose a novel MLLM-driven framework Text-Enhanced Alignment (TEA), to address the modality imbalance problem by enhancing the correlated visual-textual knowledge. TEA takes an MLLM as a video narrator to generate plausible textual descriptions of the video, thereby mitigating the modality imbalance and boosting the temporal localization. To effectively maintain temporal sensibility for localization, we design to get text narratives for each certain video timestamp and construct a structured text paragraph with time information, which is temporally aligned with the visual content. Then we perform cross-modal feature merging between the temporal-aware narratives and corresponding video temporal features to produce semantic-enhanced video representation sequences for query localization. Subsequently, we introduce a uni-modal narrative-query matching mechanism, which encourages the model to extract complementary information from contextual cohesive descriptions for improved retrieval. Extensive experiments on two benchmarks show the effectiveness and generalizability of our proposed method.
视频时刻检索(Video Moment Retrieval, VMR)的目的是在给定的自然语言查询条件下,在未修剪的长视频中定位特定的时间片段。现有的方法往往存在训练注释不足的问题,即句子通常与前景中突出的视频内容的一小部分匹配,措辞多样性有限。这种内在的形态不平衡使得相当一部分视觉信息与文本保持不对齐。它将跨模态对齐知识限制在有限的文本语料库范围内,从而导致次优的视觉文本建模和较差的泛化性。本文利用多模态大语言模型(MLLM)的视觉文本理解能力,提出了一种新的MLLM驱动框架文本增强对齐(TEA),通过增强相关的视觉文本知识来解决模态不平衡问题。TEA采用MLLM作为视频叙述者,生成可信的视频文本描述,从而缓解了模态失衡,增强了时间定位。为了有效地保持本地化的时间敏感性,我们设计获取每个特定视频时间戳的文本叙事,并构建一个具有时间信息的结构化文本段落,该段落在时间上与视觉内容对齐。然后,我们在时间感知叙事和相应的视频时间特征之间进行跨模态特征合并,生成用于查询定位的语义增强的视频表示序列。随后,我们引入了单模态叙述查询匹配机制,该机制鼓励模型从上下文内聚描述中提取互补信息,以改进检索。在两个基准上的大量实验表明了我们提出的方法的有效性和通用性。
期刊介绍:
The field of Pattern Recognition is both mature and rapidly evolving, playing a crucial role in various related fields such as computer vision, image processing, text analysis, and neural networks. It closely intersects with machine learning and is being applied in emerging areas like biometrics, bioinformatics, multimedia data analysis, and data science. The journal Pattern Recognition, established half a century ago during the early days of computer science, has since grown significantly in scope and influence.