{"title":"参考伪装目标检测的不确定性感知变压器","authors":"Ranwan Wu;Tian-Zhu Xiang;Guo-Sen Xie;Rongrong Gao;Xiangbo Shu;Fang Zhao;Ling Shao","doi":"10.1109/TIP.2025.3587579","DOIUrl":null,"url":null,"abstract":"Referring camouflaged object detection (Ref-COD) is a recently proposed task, aiming to segment specified camouflaged objects by leveraging visual reference, i.e., a small set of referring images with salient target objects. Ref-COD poses a considerable challenge due to the difficulty of discerning camouflaged objects from their highly similar backgrounds, as well as the significant feature differences between the camouflaged objects and the provided visual reference. To tackle the above dilemma, we propose a novel uncertainty-aware transformer for the Ref-COD task, termed UAT. UAT first utilizes a cross-attention mechanism to align and integrate visual reference to guide camouflaged feature learning, and then models dependencies between patches in a probabilistic manner to learn predictive uncertainty and excavate discriminative camouflaged features. Specifically, we first design a referring feature aggregation (RFA) module to align and incorporate referring features with camouflaged features, guiding targeted specific feature learning within the feature space of camouflaged images. Then, to enhance multi-level feature extraction, we develop a cross-attention encoder (CAE) to integrate global information and multi-scale semantics between adjacent layers to excavate critical camouflage cues. More importantly, we propose a transformer probabilistic decoder (TPD) to model the dependencies between patches as Gaussian random variables to capture uncertainty-aware camouflaged features. Extensive experiments on the golden Ref-COD benchmark demonstrate the superiority of UAT over existing state-of-the-art competitors. The proposed UAT also achieves competitive performance on several conventional COD datasets, further demonstrating its scalability. The source code is available at <uri>https://github.com/CVL-hub/UAT</uri>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"5341-5354"},"PeriodicalIF":13.7000,"publicationDate":"2025-07-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Uncertainty-Aware Transformer for Referring Camouflaged Object Detection\",\"authors\":\"Ranwan Wu;Tian-Zhu Xiang;Guo-Sen Xie;Rongrong Gao;Xiangbo Shu;Fang Zhao;Ling Shao\",\"doi\":\"10.1109/TIP.2025.3587579\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Referring camouflaged object detection (Ref-COD) is a recently proposed task, aiming to segment specified camouflaged objects by leveraging visual reference, i.e., a small set of referring images with salient target objects. Ref-COD poses a considerable challenge due to the difficulty of discerning camouflaged objects from their highly similar backgrounds, as well as the significant feature differences between the camouflaged objects and the provided visual reference. To tackle the above dilemma, we propose a novel uncertainty-aware transformer for the Ref-COD task, termed UAT. UAT first utilizes a cross-attention mechanism to align and integrate visual reference to guide camouflaged feature learning, and then models dependencies between patches in a probabilistic manner to learn predictive uncertainty and excavate discriminative camouflaged features. Specifically, we first design a referring feature aggregation (RFA) module to align and incorporate referring features with camouflaged features, guiding targeted specific feature learning within the feature space of camouflaged images. Then, to enhance multi-level feature extraction, we develop a cross-attention encoder (CAE) to integrate global information and multi-scale semantics between adjacent layers to excavate critical camouflage cues. More importantly, we propose a transformer probabilistic decoder (TPD) to model the dependencies between patches as Gaussian random variables to capture uncertainty-aware camouflaged features. Extensive experiments on the golden Ref-COD benchmark demonstrate the superiority of UAT over existing state-of-the-art competitors. The proposed UAT also achieves competitive performance on several conventional COD datasets, further demonstrating its scalability. The source code is available at <uri>https://github.com/CVL-hub/UAT</uri>\",\"PeriodicalId\":94032,\"journal\":{\"name\":\"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society\",\"volume\":\"34 \",\"pages\":\"5341-5354\"},\"PeriodicalIF\":13.7000,\"publicationDate\":\"2025-07-14\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/11080234/\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/11080234/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Uncertainty-Aware Transformer for Referring Camouflaged Object Detection
Referring camouflaged object detection (Ref-COD) is a recently proposed task, aiming to segment specified camouflaged objects by leveraging visual reference, i.e., a small set of referring images with salient target objects. Ref-COD poses a considerable challenge due to the difficulty of discerning camouflaged objects from their highly similar backgrounds, as well as the significant feature differences between the camouflaged objects and the provided visual reference. To tackle the above dilemma, we propose a novel uncertainty-aware transformer for the Ref-COD task, termed UAT. UAT first utilizes a cross-attention mechanism to align and integrate visual reference to guide camouflaged feature learning, and then models dependencies between patches in a probabilistic manner to learn predictive uncertainty and excavate discriminative camouflaged features. Specifically, we first design a referring feature aggregation (RFA) module to align and incorporate referring features with camouflaged features, guiding targeted specific feature learning within the feature space of camouflaged images. Then, to enhance multi-level feature extraction, we develop a cross-attention encoder (CAE) to integrate global information and multi-scale semantics between adjacent layers to excavate critical camouflage cues. More importantly, we propose a transformer probabilistic decoder (TPD) to model the dependencies between patches as Gaussian random variables to capture uncertainty-aware camouflaged features. Extensive experiments on the golden Ref-COD benchmark demonstrate the superiority of UAT over existing state-of-the-art competitors. The proposed UAT also achieves competitive performance on several conventional COD datasets, further demonstrating its scalability. The source code is available at https://github.com/CVL-hub/UAT