{"title":"基于文本的人物搜索的粒度感知双曲表示","authors":"Chenghuan Qi;Xi Yang;Nannan Wang;Xinbo Gao","doi":"10.1109/TIFS.2025.3574970","DOIUrl":null,"url":null,"abstract":"Text-based person search aims to identify specific target person from the database according to the given text description. Early work adopted separately pretrained encoders to extract visual and textual features, but benefit from the bloom of visual language pre-training, recent work uses unified pretrained visual language models such as CLIP as backbone. However, visual language models are generally pretrained from coarse-grained image-text pairs, while image-text pairs in text-based person search are more fine-grained to distinguish different persons. In addition, visual and linguistic concepts naturally organize themselves in a hierarchy, which is not explicitly captured by current large-scale vision and language models such as CLIP. To bridge this gap, we propose a novel Granularity-Aware Hyperbolic Representation learning method for mining granularity and capturing semantic hierarchy. Notably, we consider both token-level and instance-level granularity. For token-granularity alignment, we present a Bidirectional Attention Interaction module to explicitly learn the matching between fine-grained visual tokens and text tokens. For instance-granularity alignment, we equip the contrastive learning loss with Semantic Margin Softmax so that image-text pairs can perceive the similarity granularity of different samples during training. Besides, the global features of images and texts are mapped into hyperbolic space through Hyperbolic Representation Learning to embed tree-like data to capture semantic hierarchy. Extensive experiments verify the effectiveness of our proposed modules and show that our method achieves state-of-the-art results on the three widely acknowledged benchmarks, namely CUHK-PEDES, ICFG-PEDES, and RSTPReID. Our code is available at <uri>https://github.com/7chQ/GAHR</uri>","PeriodicalId":13492,"journal":{"name":"IEEE Transactions on Information Forensics and Security","volume":"20 ","pages":"5745-5757"},"PeriodicalIF":8.0000,"publicationDate":"2025-03-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Granularity-Aware Hyperbolic Representation for Text-Based Person Search\",\"authors\":\"Chenghuan Qi;Xi Yang;Nannan Wang;Xinbo Gao\",\"doi\":\"10.1109/TIFS.2025.3574970\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Text-based person search aims to identify specific target person from the database according to the given text description. Early work adopted separately pretrained encoders to extract visual and textual features, but benefit from the bloom of visual language pre-training, recent work uses unified pretrained visual language models such as CLIP as backbone. However, visual language models are generally pretrained from coarse-grained image-text pairs, while image-text pairs in text-based person search are more fine-grained to distinguish different persons. In addition, visual and linguistic concepts naturally organize themselves in a hierarchy, which is not explicitly captured by current large-scale vision and language models such as CLIP. To bridge this gap, we propose a novel Granularity-Aware Hyperbolic Representation learning method for mining granularity and capturing semantic hierarchy. Notably, we consider both token-level and instance-level granularity. For token-granularity alignment, we present a Bidirectional Attention Interaction module to explicitly learn the matching between fine-grained visual tokens and text tokens. For instance-granularity alignment, we equip the contrastive learning loss with Semantic Margin Softmax so that image-text pairs can perceive the similarity granularity of different samples during training. Besides, the global features of images and texts are mapped into hyperbolic space through Hyperbolic Representation Learning to embed tree-like data to capture semantic hierarchy. Extensive experiments verify the effectiveness of our proposed modules and show that our method achieves state-of-the-art results on the three widely acknowledged benchmarks, namely CUHK-PEDES, ICFG-PEDES, and RSTPReID. Our code is available at <uri>https://github.com/7chQ/GAHR</uri>\",\"PeriodicalId\":13492,\"journal\":{\"name\":\"IEEE Transactions on Information Forensics and Security\",\"volume\":\"20 \",\"pages\":\"5745-5757\"},\"PeriodicalIF\":8.0000,\"publicationDate\":\"2025-03-29\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Information Forensics and Security\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/11018097/\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, THEORY & METHODS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Information Forensics and Security","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/11018097/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}
Granularity-Aware Hyperbolic Representation for Text-Based Person Search
Text-based person search aims to identify specific target person from the database according to the given text description. Early work adopted separately pretrained encoders to extract visual and textual features, but benefit from the bloom of visual language pre-training, recent work uses unified pretrained visual language models such as CLIP as backbone. However, visual language models are generally pretrained from coarse-grained image-text pairs, while image-text pairs in text-based person search are more fine-grained to distinguish different persons. In addition, visual and linguistic concepts naturally organize themselves in a hierarchy, which is not explicitly captured by current large-scale vision and language models such as CLIP. To bridge this gap, we propose a novel Granularity-Aware Hyperbolic Representation learning method for mining granularity and capturing semantic hierarchy. Notably, we consider both token-level and instance-level granularity. For token-granularity alignment, we present a Bidirectional Attention Interaction module to explicitly learn the matching between fine-grained visual tokens and text tokens. For instance-granularity alignment, we equip the contrastive learning loss with Semantic Margin Softmax so that image-text pairs can perceive the similarity granularity of different samples during training. Besides, the global features of images and texts are mapped into hyperbolic space through Hyperbolic Representation Learning to embed tree-like data to capture semantic hierarchy. Extensive experiments verify the effectiveness of our proposed modules and show that our method achieves state-of-the-art results on the three widely acknowledged benchmarks, namely CUHK-PEDES, ICFG-PEDES, and RSTPReID. Our code is available at https://github.com/7chQ/GAHR
期刊介绍:
The IEEE Transactions on Information Forensics and Security covers the sciences, technologies, and applications relating to information forensics, information security, biometrics, surveillance and systems applications that incorporate these features