CLIP-Based Multi-Modal Feature Learning for Cloth-Changing Person Re-Identification

IF 13.7

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society Pub Date : 2025-09-01 DOI:10.1109/TIP.2025.3602641

Guoqing Zhang;Jieqiong Zhou;Lu Jiang;Yuhui Zheng;Weisi Lin

{"title":"CLIP-Based Multi-Modal Feature Learning for Cloth-Changing Person Re-Identification","authors":"Guoqing Zhang;Jieqiong Zhou;Lu Jiang;Yuhui Zheng;Weisi Lin","doi":"10.1109/TIP.2025.3602641","DOIUrl":null,"url":null,"abstract":"Contrastive Language-Image Pre-training (CLIP) has achieved remarkable results in the field of person re-identification (ReID) due to its excellent cross-modal understanding ability and high scalability. Since the text encoder of CLIP mainly focuses on easy-to-describe attributes such as clothing, and clothing is the main interference factor that reduces the recognition accuracy in cloth-changing person ReID (CC ReID). Consequently, directly applying CLIP to cloth-changing scenario may be difficult to adapt to such dynamic feature changes, thereby affecting the precision of identification. To solve this challenge, we propose a CLIP-based multi-modal feature learning framework (CMFF) for CC ReID. Specifically, we first design a pose-aware identity enhancement module (PIE) to enhance the model’s perception of identity-intrinsic information. In this branch, to weaken the interference of clothing information, we apply a ranking loss to minimize the difference between appearance and pose in the feature space. Secondly, we propose a global-local hybrid attention module (GLHA), which fuses head and global features through a cross-attention mechanism, enhancing the global recognition ability of key head information. Finally, considering that existing CLIP-based methods often ignore the potential importance of shallow features, we propose a graph-based multi-layer interactive enhancement module (GMIE), which groups and integrates multi-layer features of the image encoder, aiming to enhance the contextual awareness of multi-scale features. Extensive experiments on multiple popular pedestrian datasets validate the outstanding performance of our proposed CMFF.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"5570-5583"},"PeriodicalIF":13.7000,"publicationDate":"2025-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/11146470/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Contrastive Language-Image Pre-training (CLIP) has achieved remarkable results in the field of person re-identification (ReID) due to its excellent cross-modal understanding ability and high scalability. Since the text encoder of CLIP mainly focuses on easy-to-describe attributes such as clothing, and clothing is the main interference factor that reduces the recognition accuracy in cloth-changing person ReID (CC ReID). Consequently, directly applying CLIP to cloth-changing scenario may be difficult to adapt to such dynamic feature changes, thereby affecting the precision of identification. To solve this challenge, we propose a CLIP-based multi-modal feature learning framework (CMFF) for CC ReID. Specifically, we first design a pose-aware identity enhancement module (PIE) to enhance the model’s perception of identity-intrinsic information. In this branch, to weaken the interference of clothing information, we apply a ranking loss to minimize the difference between appearance and pose in the feature space. Secondly, we propose a global-local hybrid attention module (GLHA), which fuses head and global features through a cross-attention mechanism, enhancing the global recognition ability of key head information. Finally, considering that existing CLIP-based methods often ignore the potential importance of shallow features, we propose a graph-based multi-layer interactive enhancement module (GMIE), which groups and integrates multi-layer features of the image encoder, aiming to enhance the contextual awareness of multi-scale features. Extensive experiments on multiple popular pedestrian datasets validate the outstanding performance of our proposed CMFF.

查看原文本刊更多论文

基于clip的换衣人再识别多模态特征学习

对比语言图像预训练（CLIP）以其出色的跨模态理解能力和高扩展性在人再识别领域取得了显著的成果。由于CLIP的文本编码器主要关注服装等易于描述的属性，而服装是降低换布人ReID （CC ReID）识别精度的主要干扰因素。因此，直接将CLIP应用于换布场景可能难以适应这种动态特征变化，从而影响识别的精度。为了解决这一挑战，我们提出了一个基于clip的CC ReID多模态特征学习框架（CMFF）。具体而言，我们首先设计了一个姿态感知身份增强模块（PIE）来增强模型对身份内在信息的感知。在该分支中，为了减弱服装信息的干扰，我们采用了排序损失来最小化特征空间中外观和姿态的差异。其次，我们提出了一种全局-局部混合注意模块（GLHA），该模块通过交叉注意机制融合头部特征和全局特征，增强了关键头部信息的全局识别能力。最后，考虑到现有的基于clip的方法往往忽略了浅层特征的潜在重要性，我们提出了一种基于图的多层交互增强模块（GMIE），该模块对图像编码器的多层特征进行分组和集成，旨在增强对多尺度特征的上下文感知。在多个流行的行人数据集上进行的大量实验验证了我们提出的CMFF的出色性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society

自引率

0.00%

发文量