Ji Du;Jiesheng Wu;Desheng Kong;Weiyun Liang;Fangwei Hao;Jing Xu;Bin Wang;Guiling Wang;Ping Li
{"title":"UpGen:释放基础模型的潜力,通过生成模型进行无需训练的伪装检测","authors":"Ji Du;Jiesheng Wu;Desheng Kong;Weiyun Liang;Fangwei Hao;Jing Xu;Bin Wang;Guiling Wang;Ping Li","doi":"10.1109/TIP.2025.3599101","DOIUrl":null,"url":null,"abstract":"Camouflaged Object Detection (COD) aims to segment objects resembling their environment. To address the challenges of extensive annotations and complex optimizations in supervised learning, recent prompt-based segmentation methods excavate insightful prompts from Large Vision-Language Models (LVLMs) and refine them using various foundation models. These are subsequently fed into the Segment Anything Model (SAM) for segmentation. However, due to the hallucinations of LVLMs and insufficient image-prompt interactions during the refinement stage, these prompts often struggle to capture well-established class differentiation and localization of camouflaged objects, resulting in performance degradation. To provide SAM with more informative prompts, we present UpGen, a pipeline that prompts SAM with generative prompts without requiring training, marking a novel integration of generative models with LVLMs. Specifically, we propose the Multi-Student-Single-Teacher (MSST) knowledge integration framework to alleviate hallucinations of LVLMs. This framework integrates insights from multiple sources to enhance the classification of camouflaged objects. To enhance interactions during the prompt refinement stage, we are the first to leverage generative models on real camouflage images to produce SAM-style prompts without fine-tuning. By capitalizing on the unique learning mechanism and structure of generative models, we effectively enable image-prompt interactions and generate highly informative prompts for SAM. Our extensive experiments demonstrate that UpGen outperforms weakly-supervised models and its SAM-based counterparts. We also integrate our framework into existing weakly-supervised methods to generate pseudo-labels, resulting in consistent performance gains. Moreover, with minor adjustments, UpGen shows promising results in open-vocabulary COD, referring COD, salient object detection, marine animal segmentation, and transparent object segmentation.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"5400-5413"},"PeriodicalIF":13.7000,"publicationDate":"2025-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"UpGen: Unleashing Potential of Foundation Models for Training-Free Camouflage Detection via Generative Models\",\"authors\":\"Ji Du;Jiesheng Wu;Desheng Kong;Weiyun Liang;Fangwei Hao;Jing Xu;Bin Wang;Guiling Wang;Ping Li\",\"doi\":\"10.1109/TIP.2025.3599101\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Camouflaged Object Detection (COD) aims to segment objects resembling their environment. To address the challenges of extensive annotations and complex optimizations in supervised learning, recent prompt-based segmentation methods excavate insightful prompts from Large Vision-Language Models (LVLMs) and refine them using various foundation models. These are subsequently fed into the Segment Anything Model (SAM) for segmentation. However, due to the hallucinations of LVLMs and insufficient image-prompt interactions during the refinement stage, these prompts often struggle to capture well-established class differentiation and localization of camouflaged objects, resulting in performance degradation. To provide SAM with more informative prompts, we present UpGen, a pipeline that prompts SAM with generative prompts without requiring training, marking a novel integration of generative models with LVLMs. Specifically, we propose the Multi-Student-Single-Teacher (MSST) knowledge integration framework to alleviate hallucinations of LVLMs. This framework integrates insights from multiple sources to enhance the classification of camouflaged objects. To enhance interactions during the prompt refinement stage, we are the first to leverage generative models on real camouflage images to produce SAM-style prompts without fine-tuning. By capitalizing on the unique learning mechanism and structure of generative models, we effectively enable image-prompt interactions and generate highly informative prompts for SAM. Our extensive experiments demonstrate that UpGen outperforms weakly-supervised models and its SAM-based counterparts. We also integrate our framework into existing weakly-supervised methods to generate pseudo-labels, resulting in consistent performance gains. Moreover, with minor adjustments, UpGen shows promising results in open-vocabulary COD, referring COD, salient object detection, marine animal segmentation, and transparent object segmentation.\",\"PeriodicalId\":94032,\"journal\":{\"name\":\"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society\",\"volume\":\"34 \",\"pages\":\"5400-5413\"},\"PeriodicalIF\":13.7000,\"publicationDate\":\"2025-08-21\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/11131534/\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/11131534/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
UpGen: Unleashing Potential of Foundation Models for Training-Free Camouflage Detection via Generative Models
Camouflaged Object Detection (COD) aims to segment objects resembling their environment. To address the challenges of extensive annotations and complex optimizations in supervised learning, recent prompt-based segmentation methods excavate insightful prompts from Large Vision-Language Models (LVLMs) and refine them using various foundation models. These are subsequently fed into the Segment Anything Model (SAM) for segmentation. However, due to the hallucinations of LVLMs and insufficient image-prompt interactions during the refinement stage, these prompts often struggle to capture well-established class differentiation and localization of camouflaged objects, resulting in performance degradation. To provide SAM with more informative prompts, we present UpGen, a pipeline that prompts SAM with generative prompts without requiring training, marking a novel integration of generative models with LVLMs. Specifically, we propose the Multi-Student-Single-Teacher (MSST) knowledge integration framework to alleviate hallucinations of LVLMs. This framework integrates insights from multiple sources to enhance the classification of camouflaged objects. To enhance interactions during the prompt refinement stage, we are the first to leverage generative models on real camouflage images to produce SAM-style prompts without fine-tuning. By capitalizing on the unique learning mechanism and structure of generative models, we effectively enable image-prompt interactions and generate highly informative prompts for SAM. Our extensive experiments demonstrate that UpGen outperforms weakly-supervised models and its SAM-based counterparts. We also integrate our framework into existing weakly-supervised methods to generate pseudo-labels, resulting in consistent performance gains. Moreover, with minor adjustments, UpGen shows promising results in open-vocabulary COD, referring COD, salient object detection, marine animal segmentation, and transparent object segmentation.