Yunzhi Zhuge, Sitong Gong, Lu Zhang, Qi Xu, Wenda Zhao, Jin Zhan, Huchuan Lu
{"title":"Context-Infused Trajectories: Enhancing Context and Frame Consistency in Reasoning Video Object Segmentation.","authors":"Yunzhi Zhuge, Sitong Gong, Lu Zhang, Qi Xu, Wenda Zhao, Jin Zhan, Huchuan Lu","doi":"10.1109/TIP.2026.3689427","DOIUrl":null,"url":null,"abstract":"<p><p>Reasoning video object segmentation (ReaVOS) aims to segment referred objects in video sequences based on implicit and complex linguistic queries. Existing methods typically compress limited video frames into pooled representations and prompt multimodal large language models (MLLMs) to generate a single global segmentation token. However, this strategy lacks explicit contextual guidance and causes substantial loss of spatial details, limiting capability and segmentation consistency. To overcome these limitations, we introduce Context-infused Consistent Video Segmentor (CiCVS), a novel framework leveraging contextual information to guide generation of temporally coherent and accurate mask trajectories. CiCVS incorporates a Hierarchical Frame Sampling (HFS) module, which globally samples support frames across the entire video to ensure broad temporal coverage, and then uniformly selects target frames within the support set. It also employs a Contextual Token Prompting (CTP) module, which utilizes contextual cues from support frames to guide the MLLM in generating specialized tokens for various target frames, enabling the model to capture intricate temporal patterns and ensure consistency across long-range sequences. At the core of CTP is the Multimodal Injection Compressor (MIC) block, which efficiently integrates support frame features and textual semantic information into a compact set of latent queries, enhancing temporal-level object perception. To further advance the ReaVOS field, we introduce the CoCoRVOS benchmark, which features more temporally intricate reasoning instructions and a diverse set of video scenarios. Extensive experiments demonstrate that CiCVS establishes a new state-of-the-art on multiple benchmarks, achieving significant improvements in J&F scores, including +2.7 on CoCoRVOS, +1.4 on ReVOS, and +7.0 on ReasonVOS, underscoring its superior contextual reasoning and segmentation capabilities.</p>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"PP ","pages":""},"PeriodicalIF":13.7000,"publicationDate":"2026-05-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/TIP.2026.3689427","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Reasoning video object segmentation (ReaVOS) aims to segment referred objects in video sequences based on implicit and complex linguistic queries. Existing methods typically compress limited video frames into pooled representations and prompt multimodal large language models (MLLMs) to generate a single global segmentation token. However, this strategy lacks explicit contextual guidance and causes substantial loss of spatial details, limiting capability and segmentation consistency. To overcome these limitations, we introduce Context-infused Consistent Video Segmentor (CiCVS), a novel framework leveraging contextual information to guide generation of temporally coherent and accurate mask trajectories. CiCVS incorporates a Hierarchical Frame Sampling (HFS) module, which globally samples support frames across the entire video to ensure broad temporal coverage, and then uniformly selects target frames within the support set. It also employs a Contextual Token Prompting (CTP) module, which utilizes contextual cues from support frames to guide the MLLM in generating specialized tokens for various target frames, enabling the model to capture intricate temporal patterns and ensure consistency across long-range sequences. At the core of CTP is the Multimodal Injection Compressor (MIC) block, which efficiently integrates support frame features and textual semantic information into a compact set of latent queries, enhancing temporal-level object perception. To further advance the ReaVOS field, we introduce the CoCoRVOS benchmark, which features more temporally intricate reasoning instructions and a diverse set of video scenarios. Extensive experiments demonstrate that CiCVS establishes a new state-of-the-art on multiple benchmarks, achieving significant improvements in J&F scores, including +2.7 on CoCoRVOS, +1.4 on ReVOS, and +7.0 on ReasonVOS, underscoring its superior contextual reasoning and segmentation capabilities.