Deep Residual Coupled Prompt Learning for Zero-Shot Sketch-Based Image Retrieval

IF 7.5 3区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Big Data Pub Date : 2024-10-16 DOI:10.1109/TBDATA.2024.3481898

Guangyao Zhuo;Zhenqiu Shu;Zhengtao Yu

{"title":"Deep Residual Coupled Prompt Learning for Zero-Shot Sketch-Based Image Retrieval","authors":"Guangyao Zhuo;Zhenqiu Shu;Zhengtao Yu","doi":"10.1109/TBDATA.2024.3481898","DOIUrl":null,"url":null,"abstract":"Zero-shot sketch-based image retrieval (ZS-SBIR) aims to utilize freehand sketches for retrieving natural images with similar semantics in realistic zero-shot scenarios. Existing works focus on zero-shot semantic transfer using category word embedding and leveraging teacher-student networks to alleviate catastrophic forgetting of pre-trained models. They aim to retain rich discriminative features to achieve zero-shot semantic transfer. However, the category word embedding method is insufficient in flexibility, thereby limiting their retrieval performances in ZS-SBIR scenarios. In addition, the teacher network used for generating guidance signals results in computational redundancy, requiring repeated processing of mini-batch inputs. To address these issues, we propose a deep residual coupled prompt learning (DRCPL) for ZS-SBIR. Specifically, we leverage the text encoder of CLIP to generate category classification weights, thereby improving the flexibility and generality of zero-shot semantic transfer. To tune text and vision representations effectively, we introduce learnable prompts at the input and freeze the parameters of the CLIP encoder. This approach not only effectively prevents catastrophic forgetting, but also significantly reduces the computational complexity of the model. We also introduce the text-vision prompt coupling function to enhance the coordinated consistency between the text and vision representations, ensuring that the two branches can train collaboratively. Finally, we gradually establish stage feature relationships by learning prompts independently at different early stages to facilitate rich contextual learning. Comprehensive experimental results demonstrate that our DRCPL method achieves state-of-the-art performance in ZS-SBIR tasks.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"11 3","pages":"1493-1507"},"PeriodicalIF":7.5000,"publicationDate":"2024-10-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Big Data","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10720065/","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

Zero-shot sketch-based image retrieval (ZS-SBIR) aims to utilize freehand sketches for retrieving natural images with similar semantics in realistic zero-shot scenarios. Existing works focus on zero-shot semantic transfer using category word embedding and leveraging teacher-student networks to alleviate catastrophic forgetting of pre-trained models. They aim to retain rich discriminative features to achieve zero-shot semantic transfer. However, the category word embedding method is insufficient in flexibility, thereby limiting their retrieval performances in ZS-SBIR scenarios. In addition, the teacher network used for generating guidance signals results in computational redundancy, requiring repeated processing of mini-batch inputs. To address these issues, we propose a deep residual coupled prompt learning (DRCPL) for ZS-SBIR. Specifically, we leverage the text encoder of CLIP to generate category classification weights, thereby improving the flexibility and generality of zero-shot semantic transfer. To tune text and vision representations effectively, we introduce learnable prompts at the input and freeze the parameters of the CLIP encoder. This approach not only effectively prevents catastrophic forgetting, but also significantly reduces the computational complexity of the model. We also introduce the text-vision prompt coupling function to enhance the coordinated consistency between the text and vision representations, ensuring that the two branches can train collaboratively. Finally, we gradually establish stage feature relationships by learning prompts independently at different early stages to facilitate rich contextual learning. Comprehensive experimental results demonstrate that our DRCPL method achieves state-of-the-art performance in ZS-SBIR tasks.

查看原文本刊更多论文

基于深度残差耦合提示学习的零拍摄草图图像检索

基于零镜头草图的图像检索（ZS-SBIR）旨在利用手绘草图在真实的零镜头场景中检索具有相似语义的自然图像。现有的工作主要集中在使用类别词嵌入和利用师生网络来减轻预训练模型的灾难性遗忘的零概率语义迁移。它们旨在保留丰富的判别特征，以实现零射击语义迁移。然而，类别词嵌入方法的灵活性不足，限制了它们在ZS-SBIR场景下的检索性能。此外，用于生成引导信号的教师网络导致计算冗余，需要重复处理小批量输入。为了解决这些问题，我们提出了一种用于ZS-SBIR的深度残差耦合提示学习（DRCPL）。具体来说，我们利用CLIP的文本编码器来生成类别分类权值，从而提高了零射击语义转移的灵活性和通用性。为了有效地调整文本和视觉表示，我们在输入处引入可学习提示，并冻结CLIP编码器的参数。该方法不仅有效地防止了灾难性遗忘，而且显著降低了模型的计算复杂度。我们还引入了文本-视觉提示耦合功能，以增强文本和视觉表示之间的协调一致性，确保两个分支能够协同训练。最后，通过在不同的早期阶段独立学习提示语，逐步建立阶段特征关系，促进丰富的语境学习。综合实验结果表明，我们的DRCPL方法在ZS-SBIR任务中达到了最先进的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Transactions on Big Data Multiple-

CiteScore

11.80

自引率

2.80%

发文量

114

期刊介绍： The IEEE Transactions on Big Data publishes peer-reviewed articles focusing on big data. These articles present innovative research ideas and application results across disciplines, including novel theories, algorithms, and applications. Research areas cover a wide range, such as big data analytics, visualization, curation, management, semantics, infrastructure, standards, performance analysis, intelligence extraction, scientific discovery, security, privacy, and legal issues specific to big data. The journal also prioritizes applications of big data in fields generating massive datasets.