基于clip的摄像机不可知特征学习在摄像机内监督下的人物再识别

IF 8.3 1区工程技术 Q1 ENGINEERING, ELECTRICAL & ELECTRONIC

IEEE Transactions on Circuits and Systems for Video Technology Pub Date : 2024-12-24 DOI:10.1109/TCSVT.2024.3522178

Xuan Tan;Xun Gong;Yang Xiang

{"title":"基于clip的摄像机不可知特征学习在摄像机内监督下的人物再识别","authors":"Xuan Tan;Xun Gong;Yang Xiang","doi":"10.1109/TCSVT.2024.3522178","DOIUrl":null,"url":null,"abstract":"Contrastive Language-Image Pre-Training (CLIP) model excels in traditional person re-identification (ReID) tasks due to its inherent advantage in generating textual descriptions for pedestrian images. However, applying CLIP directly to intra-camera supervised person re-identification (ICS ReID) presents challenges. ICS ReID requires independent identity labeling within each camera, without associations across cameras. This limits the effectiveness of text-based enhancements. To address this, we propose a novel framework called CLIP-based Camera-Agnostic Feature Learning (CCAFL) for ICS ReID. Accordingly, two custom modules are designed to guide the model to actively learn camera-agnostic pedestrian features: Intra-Camera Discriminative Learning (ICDL) and Inter-Camera Adversarial Learning (ICAL). Specifically, we first establish learnable textual prompts for intra-camera pedestrian images to obtain crucial semantic supervision signals for subsequent intra- and inter-camera learning. Then, we design ICDL to increase inter-class variation by considering the hard positive and hard negative samples within each camera, thereby learning intra-camera finer-grained pedestrian features. Additionally, we propose ICAL to reduce inter-camera pedestrian feature discrepancies by penalizing the model’s ability to predict the camera from which a pedestrian image originates, thus enhancing the model’s capability to recognize pedestrians from different viewpoints. Extensive experiments on popular ReID datasets demonstrate the effectiveness of our approach. Especially, on the challenging MSMT17 dataset, we arrive at 58.9% in terms of mAP accuracy, surpassing state-of-the-art methods by 7.6%. Code is available at <uri>https://gitee.com/swjtugx/classmate/tree/master/OurGroup/CCAFL</uri>.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 5","pages":"4100-4115"},"PeriodicalIF":8.3000,"publicationDate":"2024-12-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"CLIP-Based Camera-Agnostic Feature Learning for Intra-Camera Supervised Person Re-Identification\",\"authors\":\"Xuan Tan;Xun Gong;Yang Xiang\",\"doi\":\"10.1109/TCSVT.2024.3522178\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Contrastive Language-Image Pre-Training (CLIP) model excels in traditional person re-identification (ReID) tasks due to its inherent advantage in generating textual descriptions for pedestrian images. However, applying CLIP directly to intra-camera supervised person re-identification (ICS ReID) presents challenges. ICS ReID requires independent identity labeling within each camera, without associations across cameras. This limits the effectiveness of text-based enhancements. To address this, we propose a novel framework called CLIP-based Camera-Agnostic Feature Learning (CCAFL) for ICS ReID. Accordingly, two custom modules are designed to guide the model to actively learn camera-agnostic pedestrian features: Intra-Camera Discriminative Learning (ICDL) and Inter-Camera Adversarial Learning (ICAL). Specifically, we first establish learnable textual prompts for intra-camera pedestrian images to obtain crucial semantic supervision signals for subsequent intra- and inter-camera learning. Then, we design ICDL to increase inter-class variation by considering the hard positive and hard negative samples within each camera, thereby learning intra-camera finer-grained pedestrian features. Additionally, we propose ICAL to reduce inter-camera pedestrian feature discrepancies by penalizing the model’s ability to predict the camera from which a pedestrian image originates, thus enhancing the model’s capability to recognize pedestrians from different viewpoints. Extensive experiments on popular ReID datasets demonstrate the effectiveness of our approach. Especially, on the challenging MSMT17 dataset, we arrive at 58.9% in terms of mAP accuracy, surpassing state-of-the-art methods by 7.6%. Code is available at <uri>https://gitee.com/swjtugx/classmate/tree/master/OurGroup/CCAFL</uri>.\",\"PeriodicalId\":13082,\"journal\":{\"name\":\"IEEE Transactions on Circuits and Systems for Video Technology\",\"volume\":\"35 5\",\"pages\":\"4100-4115\"},\"PeriodicalIF\":8.3000,\"publicationDate\":\"2024-12-24\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Circuits and Systems for Video Technology\",\"FirstCategoryId\":\"5\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10813454/\",\"RegionNum\":1,\"RegionCategory\":\"工程技术\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"ENGINEERING, ELECTRICAL & ELECTRONIC\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Circuits and Systems for Video Technology","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10813454/","RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}

引用次数: 0

摘要

对比语言图像预训练（CLIP）模型在生成行人图像的文本描述方面具有固有的优势，在传统的人物再识别（ReID）任务中表现优异。然而，将CLIP直接应用于摄像机内监督人员再识别（ICS ReID）存在挑战。ICS ReID要求在每个摄像机内进行独立的身份标识，而不需要在摄像机之间进行关联。这限制了基于文本的增强的有效性。为了解决这个问题，我们提出了一个新的框架，称为基于clip的摄像机不可知论特征学习（CCAFL）。因此，设计了两个自定义模块来指导模型主动学习与相机无关的行人特征：相机内判别学习（ICDL）和相机间对抗学习（ICAL）。具体来说，我们首先为摄像头内的行人图像建立可学习的文本提示，为后续的摄像头内和摄像头间学习获得关键的语义监督信号。然后，我们设计了ICDL，通过考虑每个相机内的硬正样本和硬负样本来增加类间变化，从而学习相机内细粒度的行人特征。此外，我们提出ICAL通过惩罚模型预测行人图像来源的相机的能力来减少相机间的行人特征差异，从而增强模型从不同角度识别行人的能力。在流行的ReID数据集上进行的大量实验证明了我们方法的有效性。特别是，在具有挑战性的MSMT17数据集上，我们的mAP准确率达到58.9%，比目前最先进的方法高出7.6%。代码可从https://gitee.com/swjtugx/classmate/tree/master/OurGroup/CCAFL获得。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

CLIP-Based Camera-Agnostic Feature Learning for Intra-Camera Supervised Person Re-Identification

Contrastive Language-Image Pre-Training (CLIP) model excels in traditional person re-identification (ReID) tasks due to its inherent advantage in generating textual descriptions for pedestrian images. However, applying CLIP directly to intra-camera supervised person re-identification (ICS ReID) presents challenges. ICS ReID requires independent identity labeling within each camera, without associations across cameras. This limits the effectiveness of text-based enhancements. To address this, we propose a novel framework called CLIP-based Camera-Agnostic Feature Learning (CCAFL) for ICS ReID. Accordingly, two custom modules are designed to guide the model to actively learn camera-agnostic pedestrian features: Intra-Camera Discriminative Learning (ICDL) and Inter-Camera Adversarial Learning (ICAL). Specifically, we first establish learnable textual prompts for intra-camera pedestrian images to obtain crucial semantic supervision signals for subsequent intra- and inter-camera learning. Then, we design ICDL to increase inter-class variation by considering the hard positive and hard negative samples within each camera, thereby learning intra-camera finer-grained pedestrian features. Additionally, we propose ICAL to reduce inter-camera pedestrian feature discrepancies by penalizing the model’s ability to predict the camera from which a pedestrian image originates, thus enhancing the model’s capability to recognize pedestrians from different viewpoints. Extensive experiments on popular ReID datasets demonstrate the effectiveness of our approach. Especially, on the challenging MSMT17 dataset, we arrive at 58.9% in terms of mAP accuracy, surpassing state-of-the-art methods by 7.6%. Code is available at https://gitee.com/swjtugx/classmate/tree/master/OurGroup/CCAFL.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE Transactions on Circuits and Systems for Video Technology 工程技术-工程：电子与电气

CiteScore

13.80

自引率

27.40%

发文量

660

审稿时长

5 months

期刊介绍： The IEEE Transactions on Circuits and Systems for Video Technology (TCSVT) is dedicated to covering all aspects of video technologies from a circuits and systems perspective. We encourage submissions of general, theoretical, and application-oriented papers related to image and video acquisition, representation, presentation, and display. Additionally, we welcome contributions in areas such as processing, filtering, and transforms; analysis and synthesis; learning and understanding; compression, transmission, communication, and networking; as well as storage, retrieval, indexing, and search. Furthermore, papers focusing on hardware and software design and implementation are highly valued. Join us in advancing the field of video technology through innovative research and insights.