Yi Shi;Rui-Xiang Li;Le Gan;De-Chuan Zhan;Han-Jia Ye
{"title":"Generalized Conditional Similarity Learning via Semantic Matching","authors":"Yi Shi;Rui-Xiang Li;Le Gan;De-Chuan Zhan;Han-Jia Ye","doi":"10.1109/TPAMI.2025.3535730","DOIUrl":null,"url":null,"abstract":"The inherent complexity of image semantics engenders a fascinating variability in relationships between images. For instance, under a certain condition, two images may demonstrate similarity, while under different circumstances, the same pair could exhibit absolute dissimilarity. A singular feature space is therefore insufficient for capturing the nuanced semantic relationships that exist between samples. Conditional Similarity Learning (CSL) aims to address this gap by learning multiple, distinct feature spaces. Existing approaches in CSL often fail to capture the intricate similarity relationships between samples across different semantic conditions, particularly in weakly-supervised settings where condition labels are absent during training. To address this limitation, we introduce <bold>D</b>istance <bold>I</b>nduced <bold>S</b>emantic <bold>CO</b>ndition <bold>VER</b>ification <bold>NET</b>work (<sc>DiscoverNet</small>), a unified framework designed to cater to a range of CSL scenarios— supervised CSL (sCSL), weakly-supervised CSL (wsCSL), and semi-supervised CSL (ssCSL). In addition to traditional linear projections, we also introduce a prompt learning technique utilizing transformer encoding layer to create diverse embedding spaces. Our framework incorporates a Condition Match Module (CMM) that dynamically matches different training triplets with corresponding embedding spaces, adapting to varying levels of supervision. We also shed light on existing evaluation biases in wsCSL and introduce two novel criteria for a more robust evaluation. Through extensive experiments and visualizations on benchmark datasets such as UT-Zappos-50 k and Celeb-A, we substantiate the efficacy and interpretability of <sc>DiscoverNet</small>.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 5","pages":"3847-3862"},"PeriodicalIF":0.0000,"publicationDate":"2025-02-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on pattern analysis and machine intelligence","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10887026/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
The inherent complexity of image semantics engenders a fascinating variability in relationships between images. For instance, under a certain condition, two images may demonstrate similarity, while under different circumstances, the same pair could exhibit absolute dissimilarity. A singular feature space is therefore insufficient for capturing the nuanced semantic relationships that exist between samples. Conditional Similarity Learning (CSL) aims to address this gap by learning multiple, distinct feature spaces. Existing approaches in CSL often fail to capture the intricate similarity relationships between samples across different semantic conditions, particularly in weakly-supervised settings where condition labels are absent during training. To address this limitation, we introduce Distance Induced Semantic COndition VERification NETwork (DiscoverNet), a unified framework designed to cater to a range of CSL scenarios— supervised CSL (sCSL), weakly-supervised CSL (wsCSL), and semi-supervised CSL (ssCSL). In addition to traditional linear projections, we also introduce a prompt learning technique utilizing transformer encoding layer to create diverse embedding spaces. Our framework incorporates a Condition Match Module (CMM) that dynamically matches different training triplets with corresponding embedding spaces, adapting to varying levels of supervision. We also shed light on existing evaluation biases in wsCSL and introduce two novel criteria for a more robust evaluation. Through extensive experiments and visualizations on benchmark datasets such as UT-Zappos-50 k and Celeb-A, we substantiate the efficacy and interpretability of DiscoverNet.