MLLM作为视频解说员:缓解视频时刻检索中的模态失衡

IF 7.5 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE
Weitong Cai , Jiabo Huang , Shaogang Gong , Hailin Jin , Yang Liu
{"title":"MLLM作为视频解说员:缓解视频时刻检索中的模态失衡","authors":"Weitong Cai ,&nbsp;Jiabo Huang ,&nbsp;Shaogang Gong ,&nbsp;Hailin Jin ,&nbsp;Yang Liu","doi":"10.1016/j.patcog.2025.111670","DOIUrl":null,"url":null,"abstract":"<div><div>Video Moment Retrieval (VMR) aims to localize a specific temporal segment within an untrimmed long video given a natural language query. Existing methods often suffer from inadequate training annotations, <em>i.e.</em>, the sentence typically matches with a fraction of the prominent video content in the foreground with limited wording diversity. This intrinsic modality imbalance leaves a considerable portion of visual information remaining unaligned with text. It confines the cross-modal alignment knowledge within the scope of a limited text corpus, thereby leading to sub-optimal visual-textual modeling and poor generalizability. By leveraging the visual-textual understanding capability of multi-modal large language models (MLLM), in this work, we propose a novel MLLM-driven framework Text-Enhanced Alignment (TEA), to address the modality imbalance problem by enhancing the correlated visual-textual knowledge. TEA takes an MLLM as a video narrator to generate plausible textual descriptions of the video, thereby mitigating the modality imbalance and boosting the temporal localization. To effectively maintain temporal sensibility for localization, we design to get text narratives for each certain video timestamp and construct a structured text paragraph with time information, which is temporally aligned with the visual content. Then we perform cross-modal feature merging between the temporal-aware narratives and corresponding video temporal features to produce semantic-enhanced video representation sequences for query localization. Subsequently, we introduce a uni-modal narrative-query matching mechanism, which encourages the model to extract complementary information from contextual cohesive descriptions for improved retrieval. Extensive experiments on two benchmarks show the effectiveness and generalizability of our proposed method.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"166 ","pages":"Article 111670"},"PeriodicalIF":7.5000,"publicationDate":"2025-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"MLLM as video narrator: Mitigating modality imbalance in video moment retrieval\",\"authors\":\"Weitong Cai ,&nbsp;Jiabo Huang ,&nbsp;Shaogang Gong ,&nbsp;Hailin Jin ,&nbsp;Yang Liu\",\"doi\":\"10.1016/j.patcog.2025.111670\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Video Moment Retrieval (VMR) aims to localize a specific temporal segment within an untrimmed long video given a natural language query. Existing methods often suffer from inadequate training annotations, <em>i.e.</em>, the sentence typically matches with a fraction of the prominent video content in the foreground with limited wording diversity. This intrinsic modality imbalance leaves a considerable portion of visual information remaining unaligned with text. It confines the cross-modal alignment knowledge within the scope of a limited text corpus, thereby leading to sub-optimal visual-textual modeling and poor generalizability. By leveraging the visual-textual understanding capability of multi-modal large language models (MLLM), in this work, we propose a novel MLLM-driven framework Text-Enhanced Alignment (TEA), to address the modality imbalance problem by enhancing the correlated visual-textual knowledge. TEA takes an MLLM as a video narrator to generate plausible textual descriptions of the video, thereby mitigating the modality imbalance and boosting the temporal localization. To effectively maintain temporal sensibility for localization, we design to get text narratives for each certain video timestamp and construct a structured text paragraph with time information, which is temporally aligned with the visual content. Then we perform cross-modal feature merging between the temporal-aware narratives and corresponding video temporal features to produce semantic-enhanced video representation sequences for query localization. Subsequently, we introduce a uni-modal narrative-query matching mechanism, which encourages the model to extract complementary information from contextual cohesive descriptions for improved retrieval. Extensive experiments on two benchmarks show the effectiveness and generalizability of our proposed method.</div></div>\",\"PeriodicalId\":49713,\"journal\":{\"name\":\"Pattern Recognition\",\"volume\":\"166 \",\"pages\":\"Article 111670\"},\"PeriodicalIF\":7.5000,\"publicationDate\":\"2025-04-11\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Pattern Recognition\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0031320325003309\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Pattern Recognition","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0031320325003309","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0

摘要

视频时刻检索(Video Moment Retrieval, VMR)的目的是在给定的自然语言查询条件下,在未修剪的长视频中定位特定的时间片段。现有的方法往往存在训练注释不足的问题,即句子通常与前景中突出的视频内容的一小部分匹配,措辞多样性有限。这种内在的形态不平衡使得相当一部分视觉信息与文本保持不对齐。它将跨模态对齐知识限制在有限的文本语料库范围内,从而导致次优的视觉文本建模和较差的泛化性。本文利用多模态大语言模型(MLLM)的视觉文本理解能力,提出了一种新的MLLM驱动框架文本增强对齐(TEA),通过增强相关的视觉文本知识来解决模态不平衡问题。TEA采用MLLM作为视频叙述者,生成可信的视频文本描述,从而缓解了模态失衡,增强了时间定位。为了有效地保持本地化的时间敏感性,我们设计获取每个特定视频时间戳的文本叙事,并构建一个具有时间信息的结构化文本段落,该段落在时间上与视觉内容对齐。然后,我们在时间感知叙事和相应的视频时间特征之间进行跨模态特征合并,生成用于查询定位的语义增强的视频表示序列。随后,我们引入了单模态叙述查询匹配机制,该机制鼓励模型从上下文内聚描述中提取互补信息,以改进检索。在两个基准上的大量实验表明了我们提出的方法的有效性和通用性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
MLLM as video narrator: Mitigating modality imbalance in video moment retrieval
Video Moment Retrieval (VMR) aims to localize a specific temporal segment within an untrimmed long video given a natural language query. Existing methods often suffer from inadequate training annotations, i.e., the sentence typically matches with a fraction of the prominent video content in the foreground with limited wording diversity. This intrinsic modality imbalance leaves a considerable portion of visual information remaining unaligned with text. It confines the cross-modal alignment knowledge within the scope of a limited text corpus, thereby leading to sub-optimal visual-textual modeling and poor generalizability. By leveraging the visual-textual understanding capability of multi-modal large language models (MLLM), in this work, we propose a novel MLLM-driven framework Text-Enhanced Alignment (TEA), to address the modality imbalance problem by enhancing the correlated visual-textual knowledge. TEA takes an MLLM as a video narrator to generate plausible textual descriptions of the video, thereby mitigating the modality imbalance and boosting the temporal localization. To effectively maintain temporal sensibility for localization, we design to get text narratives for each certain video timestamp and construct a structured text paragraph with time information, which is temporally aligned with the visual content. Then we perform cross-modal feature merging between the temporal-aware narratives and corresponding video temporal features to produce semantic-enhanced video representation sequences for query localization. Subsequently, we introduce a uni-modal narrative-query matching mechanism, which encourages the model to extract complementary information from contextual cohesive descriptions for improved retrieval. Extensive experiments on two benchmarks show the effectiveness and generalizability of our proposed method.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Pattern Recognition
Pattern Recognition 工程技术-工程:电子与电气
CiteScore
14.40
自引率
16.20%
发文量
683
审稿时长
5.6 months
期刊介绍: The field of Pattern Recognition is both mature and rapidly evolving, playing a crucial role in various related fields such as computer vision, image processing, text analysis, and neural networks. It closely intersects with machine learning and is being applied in emerging areas like biometrics, bioinformatics, multimedia data analysis, and data science. The journal Pattern Recognition, established half a century ago during the early days of computer science, has since grown significantly in scope and influence.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信