Doanh C. Bui , Thinh V. Le , Ba Hung Ngo , Tae Jong Choi
{"title":"CLEAR: Cross-Transformers With Pre-Trained Language Model for Person Attribute Recognition and Retrieval","authors":"Doanh C. Bui , Thinh V. Le , Ba Hung Ngo , Tae Jong Choi","doi":"10.1016/j.patcog.2025.111486","DOIUrl":null,"url":null,"abstract":"<div><div>Person attribute recognition and attribute-based person retrieval are two core human-centric tasks. In the recognition task, the challenge lies in identifying attributes based on a person’s appearance, while the retrieval task involves searching for matching persons using attribute-based queries. In this paper, we present <span>CLEAR</span>, a unified network designed to address both tasks. We leverage our C<span><math><msup><mrow></mrow><mrow><mn>2</mn></mrow></msup></math></span>T-Net, a strong Cross-Transformers backbone that achieved state-of-the-art performance in the person attribute recognition task during the UPAR Challenge 2024, to extract visual embeddings. We then adapt it for the attribute-based person retrieval task.To extend its capabilities for the attribute-based person retrieval task, we construct pseudo-textual descriptions for attribute queries, leverage a pretrained language model to generate language-rich feature embeddings, and introduce an effective training strategy, which involves finetuning only a few additional parameters in the form of adapters to produce visual and query embeddings within the retrieval space. As the visual embeddings extracted by C<span><math><msup><mrow></mrow><mrow><mn>2</mn></mrow></msup></math></span>T-Net are highly discriminative, they align well with the proposed query embeddings during the finetuning process, facilitating improved retrieval performance.The unified <span>CLEAR</span>, model is evaluated on five benchmarks: PETA, PA100K, Market-1501, RAPv2, and UPAR2024, achieving state-of-the-art or competitive results for both tasks. Notably, it ranks as the top performer on the large-scale UPAR2024 dataset, specifically designed to test domain generalizability in real-world scenarios where test samples differ from training samples.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"164 ","pages":"Article 111486"},"PeriodicalIF":7.5000,"publicationDate":"2025-03-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Pattern Recognition","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0031320325001463","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Person attribute recognition and attribute-based person retrieval are two core human-centric tasks. In the recognition task, the challenge lies in identifying attributes based on a person’s appearance, while the retrieval task involves searching for matching persons using attribute-based queries. In this paper, we present CLEAR, a unified network designed to address both tasks. We leverage our CT-Net, a strong Cross-Transformers backbone that achieved state-of-the-art performance in the person attribute recognition task during the UPAR Challenge 2024, to extract visual embeddings. We then adapt it for the attribute-based person retrieval task.To extend its capabilities for the attribute-based person retrieval task, we construct pseudo-textual descriptions for attribute queries, leverage a pretrained language model to generate language-rich feature embeddings, and introduce an effective training strategy, which involves finetuning only a few additional parameters in the form of adapters to produce visual and query embeddings within the retrieval space. As the visual embeddings extracted by CT-Net are highly discriminative, they align well with the proposed query embeddings during the finetuning process, facilitating improved retrieval performance.The unified CLEAR, model is evaluated on five benchmarks: PETA, PA100K, Market-1501, RAPv2, and UPAR2024, achieving state-of-the-art or competitive results for both tasks. Notably, it ranks as the top performer on the large-scale UPAR2024 dataset, specifically designed to test domain generalizability in real-world scenarios where test samples differ from training samples.
期刊介绍:
The field of Pattern Recognition is both mature and rapidly evolving, playing a crucial role in various related fields such as computer vision, image processing, text analysis, and neural networks. It closely intersects with machine learning and is being applied in emerging areas like biometrics, bioinformatics, multimedia data analysis, and data science. The journal Pattern Recognition, established half a century ago during the early days of computer science, has since grown significantly in scope and influence.