DCDL: Dual Causal Disentangled Learning for Zero-Shot Sketch-Based Image Retrieval

IF 9.7 1区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Multimedia Pub Date : 2025-02-18 DOI:10.1109/TMM.2025.3543035

Qiang Li;Shihao Wang;Wei Zhang;Shaojin Bai;Weizhi Nie;Anan Liu

{"title":"DCDL: Dual Causal Disentangled Learning for Zero-Shot Sketch-Based Image Retrieval","authors":"Qiang Li;Shihao Wang;Wei Zhang;Shaojin Bai;Weizhi Nie;Anan Liu","doi":"10.1109/TMM.2025.3543035","DOIUrl":null,"url":null,"abstract":"Zero-shot sketch-based image retrieval (ZS-SBIR) is a challenging task that hinges on overcoming the cross-domain differences between sketches and images. Previous methods primarily address cross-domain differences by creating a common embedding space, improving final retrieval results. However, most previous approaches have overlooked a critical aspect: sketch-based image retrieval task actually requires only the cross-domain invariant information relevant to the retrieval. Irrelevant information (such as posture, expression, background, and specificity) may detract from retrieval accuracy. In addition, most previous methods perform well on traditional SBIR datasets but lack corresponding research on generalization and extensibility in the face of more diverse and complex data. To address these issues, we propose a Dual Causal Disentangled Learning (DCDL) for ZS-SBIR. This approach can mitigate the negative impact of irrelevant features by separating retrieval-relevant features in the latent variable space. Specifically, we constructed a causal disentanglement model using two Variational Autoencoders (VAE), each applied to the sketch and image domains, to obtain disentangled variables with exchangeable attributes. Our framework effectively integrates causal intervention with disentangled representation learning, enabling a clearer separation of cross-domain retrieval-relevant and intra-class irrelevant features, which can be recombined into new reconstructed samples. Concurrently, we designed a Dual Alignment Module (DAM), leveraging the accurate and comprehensive semantic features provided by a text encoder pre-trained on large-scale datasets to supplement semantic associations and align disentangled retrieval-relevant features. The Dual Alignment Module enhances the model's ability to generalize across diverse datasets by effectively aligning retrieval-relevant information from different domains. Extensive experiments demonstrate that our method achieves state-of-the-art (SOTA) performance on the Sketchy and TU–Berlin datasets. Additionally, more experiments on larger scale dataset QuickDraw, fine-grained datasets, Shoe-V2 and Chair-V2, as well as an inter-dataset further validate the generalization and extensibility of DCDL.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"5575-5590"},"PeriodicalIF":9.7000,"publicationDate":"2025-02-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Multimedia","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10891621/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

Zero-shot sketch-based image retrieval (ZS-SBIR) is a challenging task that hinges on overcoming the cross-domain differences between sketches and images. Previous methods primarily address cross-domain differences by creating a common embedding space, improving final retrieval results. However, most previous approaches have overlooked a critical aspect: sketch-based image retrieval task actually requires only the cross-domain invariant information relevant to the retrieval. Irrelevant information (such as posture, expression, background, and specificity) may detract from retrieval accuracy. In addition, most previous methods perform well on traditional SBIR datasets but lack corresponding research on generalization and extensibility in the face of more diverse and complex data. To address these issues, we propose a Dual Causal Disentangled Learning (DCDL) for ZS-SBIR. This approach can mitigate the negative impact of irrelevant features by separating retrieval-relevant features in the latent variable space. Specifically, we constructed a causal disentanglement model using two Variational Autoencoders (VAE), each applied to the sketch and image domains, to obtain disentangled variables with exchangeable attributes. Our framework effectively integrates causal intervention with disentangled representation learning, enabling a clearer separation of cross-domain retrieval-relevant and intra-class irrelevant features, which can be recombined into new reconstructed samples. Concurrently, we designed a Dual Alignment Module (DAM), leveraging the accurate and comprehensive semantic features provided by a text encoder pre-trained on large-scale datasets to supplement semantic associations and align disentangled retrieval-relevant features. The Dual Alignment Module enhances the model's ability to generalize across diverse datasets by effectively aligning retrieval-relevant information from different domains. Extensive experiments demonstrate that our method achieves state-of-the-art (SOTA) performance on the Sketchy and TU–Berlin datasets. Additionally, more experiments on larger scale dataset QuickDraw, fine-grained datasets, Shoe-V2 and Chair-V2, as well as an inter-dataset further validate the generalization and extensibility of DCDL.

查看原文本刊更多论文

基于零镜头草图的图像检索的双因果解纠缠学习

基于零快照草图的图像检索（ZS-SBIR）是一项具有挑战性的任务，它取决于克服草图和图像之间的跨域差异。以前的方法主要通过创建公共嵌入空间来解决跨域差异，从而提高最终的检索结果。然而，大多数以往的方法都忽略了一个关键的方面：基于草图的图像检索任务实际上只需要与检索相关的跨域不变信息。不相关的信息（如姿势、表情、背景和特异性）可能会降低检索的准确性。此外，以往的方法大多在传统的SBIR数据集上表现良好，但面对更加多样化和复杂的数据，缺乏相应的泛化和可扩展性研究。为了解决这些问题，我们提出了ZS-SBIR的双因果解纠缠学习（DCDL）。该方法通过在潜在变量空间中分离与检索相关的特征，减轻了不相关特征的负面影响。具体来说，我们使用两个变分自编码器（VAE）构建了一个因果解纠缠模型，每个变分自编码器分别应用于草图和图像域，以获得具有可交换属性的解纠缠变量。我们的框架有效地将因果干预与解纠缠表示学习相结合，使跨域检索相关和类内不相关的特征能够更清晰地分离，这些特征可以重新组合成新的重构样本。同时，我们设计了一个双对齐模块（Dual Alignment Module， DAM），利用在大规模数据集上预训练的文本编码器提供的准确和全面的语义特征来补充语义关联，并对齐无关联的检索相关特征。双对齐模块通过有效地对齐来自不同领域的检索相关信息，增强了模型跨不同数据集的泛化能力。大量的实验表明，我们的方法在Sketchy和TU-Berlin数据集上实现了最先进的（SOTA）性能。此外，在更大规模数据集QuickDraw、细粒度数据集Shoe-V2和Chair-V2以及数据集间的实验进一步验证了DCDL的泛化和可扩展性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Transactions on Multimedia 工程技术-电信学

CiteScore

11.70

自引率

11.00%

发文量

576

审稿时长

5.5 months

期刊介绍： The IEEE Transactions on Multimedia delves into diverse aspects of multimedia technology and applications, covering circuits, networking, signal processing, systems, software, and systems integration. The scope aligns with the Fields of Interest of the sponsors, ensuring a comprehensive exploration of research in multimedia.