Context-Infused Trajectories: Enhancing Context and Frame Consistency in Reasoning Video Object Segmentation.

IF 13.7

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society Pub Date : 2026-05-06 DOI:10.1109/TIP.2026.3689427

Yunzhi Zhuge, Sitong Gong, Lu Zhang, Qi Xu, Wenda Zhao, Jin Zhan, Huchuan Lu

{"title":"Context-Infused Trajectories: Enhancing Context and Frame Consistency in Reasoning Video Object Segmentation.","authors":"Yunzhi Zhuge, Sitong Gong, Lu Zhang, Qi Xu, Wenda Zhao, Jin Zhan, Huchuan Lu","doi":"10.1109/TIP.2026.3689427","DOIUrl":null,"url":null,"abstract":"<p><p>Reasoning video object segmentation (ReaVOS) aims to segment referred objects in video sequences based on implicit and complex linguistic queries. Existing methods typically compress limited video frames into pooled representations and prompt multimodal large language models (MLLMs) to generate a single global segmentation token. However, this strategy lacks explicit contextual guidance and causes substantial loss of spatial details, limiting capability and segmentation consistency. To overcome these limitations, we introduce Context-infused Consistent Video Segmentor (CiCVS), a novel framework leveraging contextual information to guide generation of temporally coherent and accurate mask trajectories. CiCVS incorporates a Hierarchical Frame Sampling (HFS) module, which globally samples support frames across the entire video to ensure broad temporal coverage, and then uniformly selects target frames within the support set. It also employs a Contextual Token Prompting (CTP) module, which utilizes contextual cues from support frames to guide the MLLM in generating specialized tokens for various target frames, enabling the model to capture intricate temporal patterns and ensure consistency across long-range sequences. At the core of CTP is the Multimodal Injection Compressor (MIC) block, which efficiently integrates support frame features and textual semantic information into a compact set of latent queries, enhancing temporal-level object perception. To further advance the ReaVOS field, we introduce the CoCoRVOS benchmark, which features more temporally intricate reasoning instructions and a diverse set of video scenarios. Extensive experiments demonstrate that CiCVS establishes a new state-of-the-art on multiple benchmarks, achieving significant improvements in J&F scores, including +2.7 on CoCoRVOS, +1.4 on ReVOS, and +7.0 on ReasonVOS, underscoring its superior contextual reasoning and segmentation capabilities.</p>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"PP ","pages":""},"PeriodicalIF":13.7000,"publicationDate":"2026-05-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/TIP.2026.3689427","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Reasoning video object segmentation (ReaVOS) aims to segment referred objects in video sequences based on implicit and complex linguistic queries. Existing methods typically compress limited video frames into pooled representations and prompt multimodal large language models (MLLMs) to generate a single global segmentation token. However, this strategy lacks explicit contextual guidance and causes substantial loss of spatial details, limiting capability and segmentation consistency. To overcome these limitations, we introduce Context-infused Consistent Video Segmentor (CiCVS), a novel framework leveraging contextual information to guide generation of temporally coherent and accurate mask trajectories. CiCVS incorporates a Hierarchical Frame Sampling (HFS) module, which globally samples support frames across the entire video to ensure broad temporal coverage, and then uniformly selects target frames within the support set. It also employs a Contextual Token Prompting (CTP) module, which utilizes contextual cues from support frames to guide the MLLM in generating specialized tokens for various target frames, enabling the model to capture intricate temporal patterns and ensure consistency across long-range sequences. At the core of CTP is the Multimodal Injection Compressor (MIC) block, which efficiently integrates support frame features and textual semantic information into a compact set of latent queries, enhancing temporal-level object perception. To further advance the ReaVOS field, we introduce the CoCoRVOS benchmark, which features more temporally intricate reasoning instructions and a diverse set of video scenarios. Extensive experiments demonstrate that CiCVS establishes a new state-of-the-art on multiple benchmarks, achieving significant improvements in J&F scores, including +2.7 on CoCoRVOS, +1.4 on ReVOS, and +7.0 on ReasonVOS, underscoring its superior contextual reasoning and segmentation capabilities.

查看原文本刊更多论文

上下文注入轨迹：增强推理视频对象分割中上下文和帧的一致性。

推理视频对象分割（ReaVOS）旨在基于隐式和复杂的语言查询对视频序列中被引用的对象进行分割。现有的方法通常是将有限的视频帧压缩成池表示，并提示多模态大语言模型（mllm）生成单个全局分割令牌。然而，这种策略缺乏明确的上下文指导，导致空间细节的大量丢失，限制了分割的能力和一致性。为了克服这些限制，我们引入了上下文注入的一致视频分割器（CiCVS），这是一种利用上下文信息来指导生成时间连贯和准确的掩模轨迹的新框架。CiCVS采用了分层帧采样（HFS）模块，该模块在整个视频中对支持帧进行全局采样，以确保广泛的时间覆盖，然后在支持集中统一选择目标帧。它还采用了上下文令牌提示（CTP）模块，该模块利用来自支持框架的上下文线索来指导MLLM为各种目标框架生成专门的令牌，从而使模型能够捕获复杂的时间模式并确保跨远程序列的一致性。CTP的核心是多模态注入压缩器（MIC）块，它有效地将支持框架特征和文本语义信息集成到一组紧凑的潜在查询中，增强了时间级对象感知。为了进一步推进ReaVOS领域，我们引入了CoCoRVOS基准，它具有时间上更复杂的推理指令和多种视频场景。广泛的实验表明，CiCVS在多个基准上建立了一个新的最先进的技术，在J&F分数上取得了显著的改进，包括CoCoRVOS上的+2.7，ReVOS上的+1.4，ReasonVOS上的+7.0，强调了其优越的上下文推理和分割能力。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society

自引率

0.00%

发文量