{"title":"Deep Residual Coupled Prompt Learning for Zero-Shot Sketch-Based Image Retrieval","authors":"Guangyao Zhuo;Zhenqiu Shu;Zhengtao Yu","doi":"10.1109/TBDATA.2024.3481898","DOIUrl":null,"url":null,"abstract":"Zero-shot sketch-based image retrieval (ZS-SBIR) aims to utilize freehand sketches for retrieving natural images with similar semantics in realistic zero-shot scenarios. Existing works focus on zero-shot semantic transfer using category word embedding and leveraging teacher-student networks to alleviate catastrophic forgetting of pre-trained models. They aim to retain rich discriminative features to achieve zero-shot semantic transfer. However, the category word embedding method is insufficient in flexibility, thereby limiting their retrieval performances in ZS-SBIR scenarios. In addition, the teacher network used for generating guidance signals results in computational redundancy, requiring repeated processing of mini-batch inputs. To address these issues, we propose a deep residual coupled prompt learning (DRCPL) for ZS-SBIR. Specifically, we leverage the text encoder of CLIP to generate category classification weights, thereby improving the flexibility and generality of zero-shot semantic transfer. To tune text and vision representations effectively, we introduce learnable prompts at the input and freeze the parameters of the CLIP encoder. This approach not only effectively prevents catastrophic forgetting, but also significantly reduces the computational complexity of the model. We also introduce the text-vision prompt coupling function to enhance the coordinated consistency between the text and vision representations, ensuring that the two branches can train collaboratively. Finally, we gradually establish stage feature relationships by learning prompts independently at different early stages to facilitate rich contextual learning. Comprehensive experimental results demonstrate that our DRCPL method achieves state-of-the-art performance in ZS-SBIR tasks.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"11 3","pages":"1493-1507"},"PeriodicalIF":7.5000,"publicationDate":"2024-10-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Big Data","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10720065/","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0
Abstract
Zero-shot sketch-based image retrieval (ZS-SBIR) aims to utilize freehand sketches for retrieving natural images with similar semantics in realistic zero-shot scenarios. Existing works focus on zero-shot semantic transfer using category word embedding and leveraging teacher-student networks to alleviate catastrophic forgetting of pre-trained models. They aim to retain rich discriminative features to achieve zero-shot semantic transfer. However, the category word embedding method is insufficient in flexibility, thereby limiting their retrieval performances in ZS-SBIR scenarios. In addition, the teacher network used for generating guidance signals results in computational redundancy, requiring repeated processing of mini-batch inputs. To address these issues, we propose a deep residual coupled prompt learning (DRCPL) for ZS-SBIR. Specifically, we leverage the text encoder of CLIP to generate category classification weights, thereby improving the flexibility and generality of zero-shot semantic transfer. To tune text and vision representations effectively, we introduce learnable prompts at the input and freeze the parameters of the CLIP encoder. This approach not only effectively prevents catastrophic forgetting, but also significantly reduces the computational complexity of the model. We also introduce the text-vision prompt coupling function to enhance the coordinated consistency between the text and vision representations, ensuring that the two branches can train collaboratively. Finally, we gradually establish stage feature relationships by learning prompts independently at different early stages to facilitate rich contextual learning. Comprehensive experimental results demonstrate that our DRCPL method achieves state-of-the-art performance in ZS-SBIR tasks.
期刊介绍:
The IEEE Transactions on Big Data publishes peer-reviewed articles focusing on big data. These articles present innovative research ideas and application results across disciplines, including novel theories, algorithms, and applications. Research areas cover a wide range, such as big data analytics, visualization, curation, management, semantics, infrastructure, standards, performance analysis, intelligence extraction, scientific discovery, security, privacy, and legal issues specific to big data. The journal also prioritizes applications of big data in fields generating massive datasets.