基于文本的人物搜索的粒度感知双曲表示

IF 8 1区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

IEEE Transactions on Information Forensics and Security Pub Date : 2025-03-29 DOI:10.1109/TIFS.2025.3574970

Chenghuan Qi;Xi Yang;Nannan Wang;Xinbo Gao

{"title":"基于文本的人物搜索的粒度感知双曲表示","authors":"Chenghuan Qi;Xi Yang;Nannan Wang;Xinbo Gao","doi":"10.1109/TIFS.2025.3574970","DOIUrl":null,"url":null,"abstract":"Text-based person search aims to identify specific target person from the database according to the given text description. Early work adopted separately pretrained encoders to extract visual and textual features, but benefit from the bloom of visual language pre-training, recent work uses unified pretrained visual language models such as CLIP as backbone. However, visual language models are generally pretrained from coarse-grained image-text pairs, while image-text pairs in text-based person search are more fine-grained to distinguish different persons. In addition, visual and linguistic concepts naturally organize themselves in a hierarchy, which is not explicitly captured by current large-scale vision and language models such as CLIP. To bridge this gap, we propose a novel Granularity-Aware Hyperbolic Representation learning method for mining granularity and capturing semantic hierarchy. Notably, we consider both token-level and instance-level granularity. For token-granularity alignment, we present a Bidirectional Attention Interaction module to explicitly learn the matching between fine-grained visual tokens and text tokens. For instance-granularity alignment, we equip the contrastive learning loss with Semantic Margin Softmax so that image-text pairs can perceive the similarity granularity of different samples during training. Besides, the global features of images and texts are mapped into hyperbolic space through Hyperbolic Representation Learning to embed tree-like data to capture semantic hierarchy. Extensive experiments verify the effectiveness of our proposed modules and show that our method achieves state-of-the-art results on the three widely acknowledged benchmarks, namely CUHK-PEDES, ICFG-PEDES, and RSTPReID. Our code is available at <uri>https://github.com/7chQ/GAHR</uri>","PeriodicalId":13492,"journal":{"name":"IEEE Transactions on Information Forensics and Security","volume":"20 ","pages":"5745-5757"},"PeriodicalIF":8.0000,"publicationDate":"2025-03-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Granularity-Aware Hyperbolic Representation for Text-Based Person Search\",\"authors\":\"Chenghuan Qi;Xi Yang;Nannan Wang;Xinbo Gao\",\"doi\":\"10.1109/TIFS.2025.3574970\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Text-based person search aims to identify specific target person from the database according to the given text description. Early work adopted separately pretrained encoders to extract visual and textual features, but benefit from the bloom of visual language pre-training, recent work uses unified pretrained visual language models such as CLIP as backbone. However, visual language models are generally pretrained from coarse-grained image-text pairs, while image-text pairs in text-based person search are more fine-grained to distinguish different persons. In addition, visual and linguistic concepts naturally organize themselves in a hierarchy, which is not explicitly captured by current large-scale vision and language models such as CLIP. To bridge this gap, we propose a novel Granularity-Aware Hyperbolic Representation learning method for mining granularity and capturing semantic hierarchy. Notably, we consider both token-level and instance-level granularity. For token-granularity alignment, we present a Bidirectional Attention Interaction module to explicitly learn the matching between fine-grained visual tokens and text tokens. For instance-granularity alignment, we equip the contrastive learning loss with Semantic Margin Softmax so that image-text pairs can perceive the similarity granularity of different samples during training. Besides, the global features of images and texts are mapped into hyperbolic space through Hyperbolic Representation Learning to embed tree-like data to capture semantic hierarchy. Extensive experiments verify the effectiveness of our proposed modules and show that our method achieves state-of-the-art results on the three widely acknowledged benchmarks, namely CUHK-PEDES, ICFG-PEDES, and RSTPReID. Our code is available at <uri>https://github.com/7chQ/GAHR</uri>\",\"PeriodicalId\":13492,\"journal\":{\"name\":\"IEEE Transactions on Information Forensics and Security\",\"volume\":\"20 \",\"pages\":\"5745-5757\"},\"PeriodicalIF\":8.0000,\"publicationDate\":\"2025-03-29\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Information Forensics and Security\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/11018097/\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, THEORY & METHODS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Information Forensics and Security","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/11018097/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}

引用次数: 0

摘要

基于文本的人物搜索旨在根据给定的文本描述从数据库中识别特定的目标人物。早期的工作采用单独预训练的编码器来提取视觉和文本特征，但得益于视觉语言预训练的蓬勃发展，最近的工作采用统一的预训练视觉语言模型（如CLIP）作为主干。然而，视觉语言模型通常是从粗粒度的图像-文本对进行预训练的，而基于文本的人物搜索中的图像-文本对则是更细粒度的，以区分不同的人。此外，视觉和语言概念自然地在层次结构中组织自己，这并没有被当前的大规模视觉和语言模型（如CLIP）明确捕获。为了弥补这一差距，我们提出了一种新的粒度感知双曲表示学习方法，用于挖掘粒度和捕获语义层次。值得注意的是，我们同时考虑了令牌级和实例级粒度。对于标记粒度对齐，我们提出了一个双向注意交互模块来显式学习细粒度视觉标记和文本标记之间的匹配。例如粒度对齐，我们在对比学习损失中加入了Semantic Margin Softmax，使图像-文本对在训练过程中能够感知不同样本的相似粒度。此外，通过双曲表示学习将图像和文本的全局特征映射到双曲空间中，嵌入树状数据以捕获语义层次。大量的实验验证了我们提出的模块的有效性，并表明我们的方法在三个广泛认可的基准（即中大- pedes， ICFG-PEDES和RSTPReID）上取得了最先进的结果。我们的代码可在https://github.com/7chQ/GAHR上获得

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Granularity-Aware Hyperbolic Representation for Text-Based Person Search

Text-based person search aims to identify specific target person from the database according to the given text description. Early work adopted separately pretrained encoders to extract visual and textual features, but benefit from the bloom of visual language pre-training, recent work uses unified pretrained visual language models such as CLIP as backbone. However, visual language models are generally pretrained from coarse-grained image-text pairs, while image-text pairs in text-based person search are more fine-grained to distinguish different persons. In addition, visual and linguistic concepts naturally organize themselves in a hierarchy, which is not explicitly captured by current large-scale vision and language models such as CLIP. To bridge this gap, we propose a novel Granularity-Aware Hyperbolic Representation learning method for mining granularity and capturing semantic hierarchy. Notably, we consider both token-level and instance-level granularity. For token-granularity alignment, we present a Bidirectional Attention Interaction module to explicitly learn the matching between fine-grained visual tokens and text tokens. For instance-granularity alignment, we equip the contrastive learning loss with Semantic Margin Softmax so that image-text pairs can perceive the similarity granularity of different samples during training. Besides, the global features of images and texts are mapped into hyperbolic space through Hyperbolic Representation Learning to embed tree-like data to capture semantic hierarchy. Extensive experiments verify the effectiveness of our proposed modules and show that our method achieves state-of-the-art results on the three widely acknowledged benchmarks, namely CUHK-PEDES, ICFG-PEDES, and RSTPReID. Our code is available at https://github.com/7chQ/GAHR

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE Transactions on Information Forensics and Security 工程技术-工程：电子与电气

CiteScore

14.40

自引率

7.40%

发文量

234

审稿时长

6.5 months

期刊介绍： The IEEE Transactions on Information Forensics and Security covers the sciences, technologies, and applications relating to information forensics, information security, biometrics, surveillance and systems applications that incorporate these features