UpGen: Unleashing Potential of Foundation Models for Training-Free Camouflage Detection via Generative Models

IF 13.7

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society Pub Date : 2025-08-21 DOI:10.1109/TIP.2025.3599101

Ji Du;Jiesheng Wu;Desheng Kong;Weiyun Liang;Fangwei Hao;Jing Xu;Bin Wang;Guiling Wang;Ping Li

{"title":"UpGen: Unleashing Potential of Foundation Models for Training-Free Camouflage Detection via Generative Models","authors":"Ji Du;Jiesheng Wu;Desheng Kong;Weiyun Liang;Fangwei Hao;Jing Xu;Bin Wang;Guiling Wang;Ping Li","doi":"10.1109/TIP.2025.3599101","DOIUrl":null,"url":null,"abstract":"Camouflaged Object Detection (COD) aims to segment objects resembling their environment. To address the challenges of extensive annotations and complex optimizations in supervised learning, recent prompt-based segmentation methods excavate insightful prompts from Large Vision-Language Models (LVLMs) and refine them using various foundation models. These are subsequently fed into the Segment Anything Model (SAM) for segmentation. However, due to the hallucinations of LVLMs and insufficient image-prompt interactions during the refinement stage, these prompts often struggle to capture well-established class differentiation and localization of camouflaged objects, resulting in performance degradation. To provide SAM with more informative prompts, we present UpGen, a pipeline that prompts SAM with generative prompts without requiring training, marking a novel integration of generative models with LVLMs. Specifically, we propose the Multi-Student-Single-Teacher (MSST) knowledge integration framework to alleviate hallucinations of LVLMs. This framework integrates insights from multiple sources to enhance the classification of camouflaged objects. To enhance interactions during the prompt refinement stage, we are the first to leverage generative models on real camouflage images to produce SAM-style prompts without fine-tuning. By capitalizing on the unique learning mechanism and structure of generative models, we effectively enable image-prompt interactions and generate highly informative prompts for SAM. Our extensive experiments demonstrate that UpGen outperforms weakly-supervised models and its SAM-based counterparts. We also integrate our framework into existing weakly-supervised methods to generate pseudo-labels, resulting in consistent performance gains. Moreover, with minor adjustments, UpGen shows promising results in open-vocabulary COD, referring COD, salient object detection, marine animal segmentation, and transparent object segmentation.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"5400-5413"},"PeriodicalIF":13.7000,"publicationDate":"2025-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/11131534/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Camouflaged Object Detection (COD) aims to segment objects resembling their environment. To address the challenges of extensive annotations and complex optimizations in supervised learning, recent prompt-based segmentation methods excavate insightful prompts from Large Vision-Language Models (LVLMs) and refine them using various foundation models. These are subsequently fed into the Segment Anything Model (SAM) for segmentation. However, due to the hallucinations of LVLMs and insufficient image-prompt interactions during the refinement stage, these prompts often struggle to capture well-established class differentiation and localization of camouflaged objects, resulting in performance degradation. To provide SAM with more informative prompts, we present UpGen, a pipeline that prompts SAM with generative prompts without requiring training, marking a novel integration of generative models with LVLMs. Specifically, we propose the Multi-Student-Single-Teacher (MSST) knowledge integration framework to alleviate hallucinations of LVLMs. This framework integrates insights from multiple sources to enhance the classification of camouflaged objects. To enhance interactions during the prompt refinement stage, we are the first to leverage generative models on real camouflage images to produce SAM-style prompts without fine-tuning. By capitalizing on the unique learning mechanism and structure of generative models, we effectively enable image-prompt interactions and generate highly informative prompts for SAM. Our extensive experiments demonstrate that UpGen outperforms weakly-supervised models and its SAM-based counterparts. We also integrate our framework into existing weakly-supervised methods to generate pseudo-labels, resulting in consistent performance gains. Moreover, with minor adjustments, UpGen shows promising results in open-vocabulary COD, referring COD, salient object detection, marine animal segmentation, and transparent object segmentation.

查看原文本刊更多论文

UpGen：释放基础模型的潜力，通过生成模型进行无需训练的伪装检测

伪装对象检测（COD）的目的是分割出与其环境相似的对象。为了解决监督学习中大量注释和复杂优化的挑战，最近基于提示的分割方法从大型视觉语言模型（LVLMs）中挖掘有洞察力的提示，并使用各种基础模型对其进行改进。这些数据随后被输入到任何片段模型（SAM）中进行分割。然而，由于lvlm的幻觉和细化阶段的图像提示交互不足，这些提示通常难以捕获已建立的类区分和伪装对象的定位，从而导致性能下降。为了向SAM提供更多信息提示，我们提出了UpGen，这是一个不需要训练就可以用生成提示提示SAM的管道，标志着生成模型与lvlm的新颖集成。具体而言，我们提出了多学生-单教师（MSST）知识整合框架来缓解lvlm的幻觉。该框架整合了来自多个来源的见解，以增强伪装对象的分类。为了在提示改进阶段增强交互，我们是第一个利用真实伪装图像上的生成模型来生成sam风格的提示而无需微调的。通过利用生成模型的独特学习机制和结构，我们有效地实现了图像提示交互，并为SAM生成高信息量的提示。我们的大量实验表明，UpGen优于弱监督模型和基于sam的同类模型。我们还将我们的框架集成到现有的弱监督方法中以生成伪标签，从而获得一致的性能增益。此外，UpGen在开放词汇COD、参考COD、显著目标检测、海洋动物分割和透明目标分割等方面也显示出良好的效果。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society

自引率

0.00%

发文量