Qiang Li;Shihao Wang;Wei Zhang;Shaojin Bai;Weizhi Nie;Anan Liu
{"title":"DCDL: Dual Causal Disentangled Learning for Zero-Shot Sketch-Based Image Retrieval","authors":"Qiang Li;Shihao Wang;Wei Zhang;Shaojin Bai;Weizhi Nie;Anan Liu","doi":"10.1109/TMM.2025.3543035","DOIUrl":null,"url":null,"abstract":"Zero-shot sketch-based image retrieval (ZS-SBIR) is a challenging task that hinges on overcoming the cross-domain differences between sketches and images. Previous methods primarily address cross-domain differences by creating a common embedding space, improving final retrieval results. However, most previous approaches have overlooked a critical aspect: sketch-based image retrieval task actually requires only the cross-domain invariant information relevant to the retrieval. Irrelevant information (such as posture, expression, background, and specificity) may detract from retrieval accuracy. In addition, most previous methods perform well on traditional SBIR datasets but lack corresponding research on generalization and extensibility in the face of more diverse and complex data. To address these issues, we propose a Dual Causal Disentangled Learning (DCDL) for ZS-SBIR. This approach can mitigate the negative impact of irrelevant features by separating retrieval-relevant features in the latent variable space. Specifically, we constructed a causal disentanglement model using two Variational Autoencoders (VAE), each applied to the sketch and image domains, to obtain disentangled variables with exchangeable attributes. Our framework effectively integrates causal intervention with disentangled representation learning, enabling a clearer separation of cross-domain retrieval-relevant and intra-class irrelevant features, which can be recombined into new reconstructed samples. Concurrently, we designed a Dual Alignment Module (DAM), leveraging the accurate and comprehensive semantic features provided by a text encoder pre-trained on large-scale datasets to supplement semantic associations and align disentangled retrieval-relevant features. The Dual Alignment Module enhances the model's ability to generalize across diverse datasets by effectively aligning retrieval-relevant information from different domains. Extensive experiments demonstrate that our method achieves state-of-the-art (SOTA) performance on the Sketchy and TU–Berlin datasets. Additionally, more experiments on larger scale dataset QuickDraw, fine-grained datasets, Shoe-V2 and Chair-V2, as well as an inter-dataset further validate the generalization and extensibility of DCDL.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"5575-5590"},"PeriodicalIF":9.7000,"publicationDate":"2025-02-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Multimedia","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10891621/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0
Abstract
Zero-shot sketch-based image retrieval (ZS-SBIR) is a challenging task that hinges on overcoming the cross-domain differences between sketches and images. Previous methods primarily address cross-domain differences by creating a common embedding space, improving final retrieval results. However, most previous approaches have overlooked a critical aspect: sketch-based image retrieval task actually requires only the cross-domain invariant information relevant to the retrieval. Irrelevant information (such as posture, expression, background, and specificity) may detract from retrieval accuracy. In addition, most previous methods perform well on traditional SBIR datasets but lack corresponding research on generalization and extensibility in the face of more diverse and complex data. To address these issues, we propose a Dual Causal Disentangled Learning (DCDL) for ZS-SBIR. This approach can mitigate the negative impact of irrelevant features by separating retrieval-relevant features in the latent variable space. Specifically, we constructed a causal disentanglement model using two Variational Autoencoders (VAE), each applied to the sketch and image domains, to obtain disentangled variables with exchangeable attributes. Our framework effectively integrates causal intervention with disentangled representation learning, enabling a clearer separation of cross-domain retrieval-relevant and intra-class irrelevant features, which can be recombined into new reconstructed samples. Concurrently, we designed a Dual Alignment Module (DAM), leveraging the accurate and comprehensive semantic features provided by a text encoder pre-trained on large-scale datasets to supplement semantic associations and align disentangled retrieval-relevant features. The Dual Alignment Module enhances the model's ability to generalize across diverse datasets by effectively aligning retrieval-relevant information from different domains. Extensive experiments demonstrate that our method achieves state-of-the-art (SOTA) performance on the Sketchy and TU–Berlin datasets. Additionally, more experiments on larger scale dataset QuickDraw, fine-grained datasets, Shoe-V2 and Chair-V2, as well as an inter-dataset further validate the generalization and extensibility of DCDL.
期刊介绍:
The IEEE Transactions on Multimedia delves into diverse aspects of multimedia technology and applications, covering circuits, networking, signal processing, systems, software, and systems integration. The scope aligns with the Fields of Interest of the sponsors, ensuring a comprehensive exploration of research in multimedia.