基于思维链推理的参数高效弱监督参考视频对象分割

IF 4.6 2区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Complex & Intelligent Systems Pub Date : 2025-05-08 DOI:10.1007/s40747-025-01900-1

Xing Wang, Zhe Xu, Yuanshi Zheng, Handing Wang

{"title":"基于思维链推理的参数高效弱监督参考视频对象分割","authors":"Xing Wang, Zhe Xu, Yuanshi Zheng, Handing Wang","doi":"10.1007/s40747-025-01900-1","DOIUrl":null,"url":null,"abstract":"Referring video object segmentation (RVOS) aims to segment the object corresponding to a language expression in a video. Most existing RVOS methods are trained using accurate per-pixel annotations, which are expensive and time-consuming to obtain. Moreover, they need to update the entire parameter of a segmentation model, making it inefficient to train as the model scale increases. In this paper, we propose a novel parameter-efficient framework under weak supervision, dubbed ReferringAdapter, to ameliorate both of issues. Specifically, we propose to adapt an off-the-shelf image segmentation model for RVOS by plugging a small set of trained parameters, i.e., an adapter, into the intermediate layer. This efficiently endows a uni-modal image segmentation model with the cross-modal ability to segment the video object referred by a language expression. To update the adapter parameters under weak supervision, instead of directly fuse the video and sentence-level language features, we propose chain-of-thought reasoning to consider the intermediate steps along the thought process. Extensive experiments demonstrate that training the adapter with 1.1% of total parameters can outperform previous weakly supervised methods by 11.6\\(-\\)15.3 mAP and achieve comparable performance with fully supervised ones.","PeriodicalId":10524,"journal":{"name":"Complex & Intelligent Systems","volume":"11 1","pages":""},"PeriodicalIF":4.6000,"publicationDate":"2025-05-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Parameter-efficient weakly supervised referring video object segmentation via chain-of-thought reasoning\",\"authors\":\"Xing Wang, Zhe Xu, Yuanshi Zheng, Handing Wang\",\"doi\":\"10.1007/s40747-025-01900-1\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Referring video object segmentation (RVOS) aims to segment the object corresponding to a language expression in a video. Most existing RVOS methods are trained using accurate per-pixel annotations, which are expensive and time-consuming to obtain. Moreover, they need to update the entire parameter of a segmentation model, making it inefficient to train as the model scale increases. In this paper, we propose a novel parameter-efficient framework under weak supervision, dubbed ReferringAdapter, to ameliorate both of issues. Specifically, we propose to adapt an off-the-shelf image segmentation model for RVOS by plugging a small set of trained parameters, i.e., an adapter, into the intermediate layer. This efficiently endows a uni-modal image segmentation model with the cross-modal ability to segment the video object referred by a language expression. To update the adapter parameters under weak supervision, instead of directly fuse the video and sentence-level language features, we propose chain-of-thought reasoning to consider the intermediate steps along the thought process. Extensive experiments demonstrate that training the adapter with 1.1% of total parameters can outperform previous weakly supervised methods by 11.6\\\\(-\\\\)15.3 mAP and achieve comparable performance with fully supervised ones.\",\"PeriodicalId\":10524,\"journal\":{\"name\":\"Complex & Intelligent Systems\",\"volume\":\"11 1\",\"pages\":\"\"},\"PeriodicalIF\":4.6000,\"publicationDate\":\"2025-05-08\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Complex & Intelligent Systems\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/10.1007/s40747-025-01900-1\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Complex & Intelligent Systems","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s40747-025-01900-1","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

参考视频对象分割（RVOS）的目的是对视频中语言表达所对应的对象进行分割。大多数现有的RVOS方法都是使用精确的逐像素注释进行训练的，这种方法成本高，耗时长。此外，他们需要更新分割模型的整个参数，使得随着模型规模的增加，训练效率低下。在本文中，我们提出了一种新的弱监督下的参数有效框架，称为ReferringAdapter，以改善这两个问题。具体来说，我们建议通过将一小组训练参数（即适配器）插入中间层来适应RVOS的现成图像分割模型。这有效地赋予了单模态图像分割模型跨模态的能力来分割由语言表达式引用的视频对象。为了在弱监督下更新适配器参数，我们提出了思维链推理，以考虑思维过程中的中间步骤，而不是直接融合视频和句子级语言特征。大量的实验表明，使用1.1对适配器进行训练% of total parameters can outperform previous weakly supervised methods by 11.6\(-\)15.3 mAP and achieve comparable performance with fully supervised ones.

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Parameter-efficient weakly supervised referring video object segmentation via chain-of-thought reasoning

Referring video object segmentation (RVOS) aims to segment the object corresponding to a language expression in a video. Most existing RVOS methods are trained using accurate per-pixel annotations, which are expensive and time-consuming to obtain. Moreover, they need to update the entire parameter of a segmentation model, making it inefficient to train as the model scale increases. In this paper, we propose a novel parameter-efficient framework under weak supervision, dubbed ReferringAdapter, to ameliorate both of issues. Specifically, we propose to adapt an off-the-shelf image segmentation model for RVOS by plugging a small set of trained parameters, i.e., an adapter, into the intermediate layer. This efficiently endows a uni-modal image segmentation model with the cross-modal ability to segment the video object referred by a language expression. To update the adapter parameters under weak supervision, instead of directly fuse the video and sentence-level language features, we propose chain-of-thought reasoning to consider the intermediate steps along the thought process. Extensive experiments demonstrate that training the adapter with 1.1% of total parameters can outperform previous weakly supervised methods by 11.6\(-\)15.3 mAP and achieve comparable performance with fully supervised ones.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Complex & Intelligent Systems COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE-

CiteScore

9.60

自引率

10.30%

发文量

297

期刊介绍： Complex & Intelligent Systems aims to provide a forum for presenting and discussing novel approaches, tools and techniques meant for attaining a cross-fertilization between the broad fields of complex systems, computational simulation, and intelligent analytics and visualization. The transdisciplinary research that the journal focuses on will expand the boundaries of our understanding by investigating the principles and processes that underlie many of the most profound problems facing society today.