AutoVMR: An autonomous event generation and localization approach for video moment retrieval

IF 6.8 1区计算机科学 0 COMPUTER SCIENCE, INFORMATION SYSTEMS

Information Sciences Pub Date : 2025-08-22 DOI:10.1016/j.ins.2025.122615

Shu Luo , Qiwei Ma , Jiawei Wang , Da Cao , Shaofei Lu

{"title":"AutoVMR: An autonomous event generation and localization approach for video moment retrieval","authors":"Shu Luo , Qiwei Ma , Jiawei Wang , Da Cao , Shaofei Lu","doi":"10.1016/j.ins.2025.122615","DOIUrl":null,"url":null,"abstract":"<div><div>Video Moment Retrieval (VMR) aims to identify a semantically relevant segment within a video based on a descriptive language query, specifying the segment's boundaries through start and end timestamps. Despite recent advancements, various VMR frameworks still rely heavily on extensive manual annotations, which are resource-intensive and not scalable for large-scale video databases. Besides, although large language model has been applied to VMR, it still suffers from sophisticated prompt design and multi-turn question answering, which is far from being automated. To address these issues, we propose AutoVMR, a novel multimodal large language model framework that employs an autonomous event generation and localization approach for VMR. AutoVMR utilizes an autoregressive architecture, accepting video input and a fixed prompt template, to generate event descriptions of video segments along with their corresponding start and end times. We also introduce a reward model based on Intersection over Union (IoU), trained using reinforcement learning from human feedback. This model is integrated into the Proximal Policy Optimization (PPO) training strategy and includes a query-time boundary generation mechanism to improve AutoVMR's performance. The reward model's modeling approach effectively filters out noise in the VMR dataset, enabling the PPO method to better comprehend video content and generate more accurate temporal localizations. Moreover, the integration of the autoregressive process with PPO training allows the model to be trained on unannotated video data, leading to improved performance in a semi-supervised setting. Experimental results demonstrate that AutoVMR outperforms traditional VMR methods and the latest multimodal large language models, achieving state-of-the-art performance.</div></div>","PeriodicalId":51063,"journal":{"name":"Information Sciences","volume":"721 ","pages":"Article 122615"},"PeriodicalIF":6.8000,"publicationDate":"2025-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information Sciences","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0020025525007480","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"0","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

Video Moment Retrieval (VMR) aims to identify a semantically relevant segment within a video based on a descriptive language query, specifying the segment's boundaries through start and end timestamps. Despite recent advancements, various VMR frameworks still rely heavily on extensive manual annotations, which are resource-intensive and not scalable for large-scale video databases. Besides, although large language model has been applied to VMR, it still suffers from sophisticated prompt design and multi-turn question answering, which is far from being automated. To address these issues, we propose AutoVMR, a novel multimodal large language model framework that employs an autonomous event generation and localization approach for VMR. AutoVMR utilizes an autoregressive architecture, accepting video input and a fixed prompt template, to generate event descriptions of video segments along with their corresponding start and end times. We also introduce a reward model based on Intersection over Union (IoU), trained using reinforcement learning from human feedback. This model is integrated into the Proximal Policy Optimization (PPO) training strategy and includes a query-time boundary generation mechanism to improve AutoVMR's performance. The reward model's modeling approach effectively filters out noise in the VMR dataset, enabling the PPO method to better comprehend video content and generate more accurate temporal localizations. Moreover, the integration of the autoregressive process with PPO training allows the model to be trained on unannotated video data, leading to improved performance in a semi-supervised setting. Experimental results demonstrate that AutoVMR outperforms traditional VMR methods and the latest multimodal large language models, achieving state-of-the-art performance.

查看原文本刊更多论文

AutoVMR：一种用于视频时刻检索的自主事件生成和定位方法

视频时刻检索（Video Moment Retrieval， VMR）旨在基于描述性语言查询识别视频中语义相关的片段，通过开始和结束时间戳指定片段的边界。尽管最近取得了进展，但各种VMR框架仍然严重依赖大量的手动注释，这是资源密集型的，并且对于大型视频数据库来说不可扩展。此外，尽管大型语言模型已经应用到虚拟磁共振中，但它仍然存在复杂的提示设计和多轮问答问题，距离自动化还很遥远。为了解决这些问题，我们提出了一种新的多模态大语言模型框架AutoVMR，它采用了VMR的自主事件生成和本地化方法。AutoVMR利用自回归架构，接受视频输入和固定提示模板，生成视频片段的事件描述以及相应的开始和结束时间。我们还引入了一个基于交集/联合（IoU）的奖励模型，该模型使用来自人类反馈的强化学习进行训练。该模型集成到近端策略优化（PPO）训练策略中，并包含查询时间边界生成机制以提高AutoVMR的性能。奖励模型的建模方法有效地滤除了VMR数据集中的噪声，使PPO方法能够更好地理解视频内容并生成更准确的时间定位。此外，自回归过程与PPO训练的集成允许模型在未注释的视频数据上进行训练，从而在半监督设置中提高性能。实验结果表明，AutoVMR方法优于传统的VMR方法和最新的多模态大语言模型，达到了最先进的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Information Sciences 工程技术-计算机：信息系统

CiteScore

14.00

自引率

17.30%

发文量

1322

审稿时长

10.4 months

期刊介绍： Informatics and Computer Science Intelligent Systems Applications is an esteemed international journal that focuses on publishing original and creative research findings in the field of information sciences. We also feature a limited number of timely tutorial and surveying contributions. Our journal aims to cater to a diverse audience, including researchers, developers, managers, strategic planners, graduate students, and anyone interested in staying up-to-date with cutting-edge research in information science, knowledge engineering, and intelligent systems. While readers are expected to share a common interest in information science, they come from varying backgrounds such as engineering, mathematics, statistics, physics, computer science, cell biology, molecular biology, management science, cognitive science, neurobiology, behavioral sciences, and biochemistry.