Shuanglin Yan, Jun Liu, Neng Dong, Liyan Zhang, Jinhui Tang
{"title":"Prototypical Prompting for Text-to-image Person Re-identification","authors":"Shuanglin Yan, Jun Liu, Neng Dong, Liyan Zhang, Jinhui Tang","doi":"arxiv-2409.09427","DOIUrl":null,"url":null,"abstract":"In this paper, we study the problem of Text-to-Image Person Re-identification\n(TIReID), which aims to find images of the same identity described by a text\nsentence from a pool of candidate images. Benefiting from Vision-Language\nPre-training, such as CLIP (Contrastive Language-Image Pretraining), the TIReID\ntechniques have achieved remarkable progress recently. However, most existing\nmethods only focus on instance-level matching and ignore identity-level\nmatching, which involves associating multiple images and texts belonging to the\nsame person. In this paper, we propose a novel prototypical prompting framework\n(Propot) designed to simultaneously model instance-level and identity-level\nmatching for TIReID. Our Propot transforms the identity-level matching problem\ninto a prototype learning problem, aiming to learn identity-enriched\nprototypes. Specifically, Propot works by 'initialize, adapt, enrich, then\naggregate'. We first use CLIP to generate high-quality initial prototypes.\nThen, we propose a domain-conditional prototypical prompting (DPP) module to\nadapt the prototypes to the TIReID task using task-related information.\nFurther, we propose an instance-conditional prototypical prompting (IPP) module\nto update prototypes conditioned on intra-modal and inter-modal instances to\nensure prototype diversity. Finally, we design an adaptive prototype\naggregation module to aggregate these prototypes, generating final\nidentity-enriched prototypes. With identity-enriched prototypes, we diffuse its\nrich identity information to instances through prototype-to-instance\ncontrastive loss to facilitate identity-level matching. Extensive experiments\nconducted on three benchmarks demonstrate the superiority of Propot compared to\nexisting TIReID methods.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"1 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Multimedia","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.09427","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
In this paper, we study the problem of Text-to-Image Person Re-identification
(TIReID), which aims to find images of the same identity described by a text
sentence from a pool of candidate images. Benefiting from Vision-Language
Pre-training, such as CLIP (Contrastive Language-Image Pretraining), the TIReID
techniques have achieved remarkable progress recently. However, most existing
methods only focus on instance-level matching and ignore identity-level
matching, which involves associating multiple images and texts belonging to the
same person. In this paper, we propose a novel prototypical prompting framework
(Propot) designed to simultaneously model instance-level and identity-level
matching for TIReID. Our Propot transforms the identity-level matching problem
into a prototype learning problem, aiming to learn identity-enriched
prototypes. Specifically, Propot works by 'initialize, adapt, enrich, then
aggregate'. We first use CLIP to generate high-quality initial prototypes.
Then, we propose a domain-conditional prototypical prompting (DPP) module to
adapt the prototypes to the TIReID task using task-related information.
Further, we propose an instance-conditional prototypical prompting (IPP) module
to update prototypes conditioned on intra-modal and inter-modal instances to
ensure prototype diversity. Finally, we design an adaptive prototype
aggregation module to aggregate these prototypes, generating final
identity-enriched prototypes. With identity-enriched prototypes, we diffuse its
rich identity information to instances through prototype-to-instance
contrastive loss to facilitate identity-level matching. Extensive experiments
conducted on three benchmarks demonstrate the superiority of Propot compared to
existing TIReID methods.