{"title":"CSGN:CLIP-driven semantic guidance network for Clothes-Changing Person Re-Identification","authors":"Yang Lu , Bin Ge , Chenxing Xia , Junming Guan","doi":"10.1016/j.cviu.2025.104406","DOIUrl":null,"url":null,"abstract":"<div><div>Clothes-Changing Person Re-identification (CCReID) aims to match identities across images of individuals in different attires. Due to the significant appearance variations caused by clothing changes, distinguishing the same identity becomes challenging, while the differences between distinct individuals are often subtle. To address this, we reduce the impact of clothing information on identity judgment by introducing linguistic modalities. Considering CLIP’s (Contrastive Language-Image Pre-training) ability to align high-level semantic information with visual features, we propose a CLIP-driven Semantic Guidance Network (CSGN), which consists of a Multi-Description Generator (MDG), a Visual Semantic Steering module (VSS), and a Heterogeneous Semantic Fusion loss (HSF). Specifically, to mitigate the color sensitivity of CLIP’s text encoder, we design the MDG to generate pseudo-text in both RGB and grayscale modalities, incorporating a combined loss function for text-image mutuality. This helps reduce the encoder’s bias towards color. Additionally, to improve the CLIP visual encoder’s ability to extract identity-independent features, we construct the VSS, which combines ResNet and ViT feature extractors to enhance visual feature extraction. Finally, recognizing the complementary nature of semantics in heterogeneous descriptions, we use HSF, which constrains visual features by focusing not only on pseudo-text derived from RGB but also on pseudo-text derived from grayscale, thereby mitigating the influence of clothing information. Experimental results show that our method outperforms existing state-of-the-art approaches.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"259 ","pages":"Article 104406"},"PeriodicalIF":4.3000,"publicationDate":"2025-06-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Vision and Image Understanding","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1077314225001298","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Clothes-Changing Person Re-identification (CCReID) aims to match identities across images of individuals in different attires. Due to the significant appearance variations caused by clothing changes, distinguishing the same identity becomes challenging, while the differences between distinct individuals are often subtle. To address this, we reduce the impact of clothing information on identity judgment by introducing linguistic modalities. Considering CLIP’s (Contrastive Language-Image Pre-training) ability to align high-level semantic information with visual features, we propose a CLIP-driven Semantic Guidance Network (CSGN), which consists of a Multi-Description Generator (MDG), a Visual Semantic Steering module (VSS), and a Heterogeneous Semantic Fusion loss (HSF). Specifically, to mitigate the color sensitivity of CLIP’s text encoder, we design the MDG to generate pseudo-text in both RGB and grayscale modalities, incorporating a combined loss function for text-image mutuality. This helps reduce the encoder’s bias towards color. Additionally, to improve the CLIP visual encoder’s ability to extract identity-independent features, we construct the VSS, which combines ResNet and ViT feature extractors to enhance visual feature extraction. Finally, recognizing the complementary nature of semantics in heterogeneous descriptions, we use HSF, which constrains visual features by focusing not only on pseudo-text derived from RGB but also on pseudo-text derived from grayscale, thereby mitigating the influence of clothing information. Experimental results show that our method outperforms existing state-of-the-art approaches.
期刊介绍:
The central focus of this journal is the computer analysis of pictorial information. Computer Vision and Image Understanding publishes papers covering all aspects of image analysis from the low-level, iconic processes of early vision to the high-level, symbolic processes of recognition and interpretation. A wide range of topics in the image understanding area is covered, including papers offering insights that differ from predominant views.
Research Areas Include:
• Theory
• Early vision
• Data structures and representations
• Shape
• Range
• Motion
• Matching and recognition
• Architecture and languages
• Vision systems