{"title":"Automated text annotation: a new paradigm for generalizable text-to-image person retrieval","authors":"Delong Liu, Peng Wang, Zhicheng Zhao, Fei Su","doi":"10.1007/s10489-025-06487-1","DOIUrl":null,"url":null,"abstract":"<p>Retrieving specific person images based on textual descriptions, known as Text-to-Image Person Retrieval (TIPR), has emerged as a challenging research problem. While existing methods primarily focus on architectural refinements and feature representation enhancements, the critical aspect of textual description quality remains understudied. We propose a novel framework that automatically generates stylistically consistent textual descriptions to enhance TIPR generalizability. Specifically, we develop a dual-model architecture employing both captioning and retrieval models to quantitatively evaluate the impact of textual descriptions on retrieval performance. Comparative analysis reveals that manually annotated descriptions exhibit significant stylistic variations due to subjective biases among different annotators. To address this, our framework utilizes the captioning model to generate structurally consistent textual descriptions, enabling subsequent training and inference of the retrieval model based on automated annotations. Notably, our framework achieves a <b>18.60%</b> improvement in Rank-1 accuracy over manual annotations on the RSTPReid dataset. We systematically investigate the impact of identity quantity during testing and explore prompt-guided strategy to enhance image caption quality. Furthermore, this paradigm ensures superior generalization capabilities for well-trained retrieval models. Extensive experiments demonstrate that our approach improves the applicability of TIPR systems.</p><p>Comparison framework of manual and automated annotation performance. The left panel illustrates the process of generating automated annotations and the details of captioner training and testing. The right panel demonstrates the training and testing processes using different image-text pairs and compares the final results on the RSTPReid dataset. This results show that the performance of automated annotations surpasses that of manual annotations on this dataset</p>","PeriodicalId":8041,"journal":{"name":"Applied Intelligence","volume":"55 7","pages":""},"PeriodicalIF":3.4000,"publicationDate":"2025-03-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Applied Intelligence","FirstCategoryId":"94","ListUrlMain":"https://link.springer.com/article/10.1007/s10489-025-06487-1","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Retrieving specific person images based on textual descriptions, known as Text-to-Image Person Retrieval (TIPR), has emerged as a challenging research problem. While existing methods primarily focus on architectural refinements and feature representation enhancements, the critical aspect of textual description quality remains understudied. We propose a novel framework that automatically generates stylistically consistent textual descriptions to enhance TIPR generalizability. Specifically, we develop a dual-model architecture employing both captioning and retrieval models to quantitatively evaluate the impact of textual descriptions on retrieval performance. Comparative analysis reveals that manually annotated descriptions exhibit significant stylistic variations due to subjective biases among different annotators. To address this, our framework utilizes the captioning model to generate structurally consistent textual descriptions, enabling subsequent training and inference of the retrieval model based on automated annotations. Notably, our framework achieves a 18.60% improvement in Rank-1 accuracy over manual annotations on the RSTPReid dataset. We systematically investigate the impact of identity quantity during testing and explore prompt-guided strategy to enhance image caption quality. Furthermore, this paradigm ensures superior generalization capabilities for well-trained retrieval models. Extensive experiments demonstrate that our approach improves the applicability of TIPR systems.
Comparison framework of manual and automated annotation performance. The left panel illustrates the process of generating automated annotations and the details of captioner training and testing. The right panel demonstrates the training and testing processes using different image-text pairs and compares the final results on the RSTPReid dataset. This results show that the performance of automated annotations surpasses that of manual annotations on this dataset
期刊介绍:
With a focus on research in artificial intelligence and neural networks, this journal addresses issues involving solutions of real-life manufacturing, defense, management, government and industrial problems which are too complex to be solved through conventional approaches and require the simulation of intelligent thought processes, heuristics, applications of knowledge, and distributed and parallel processing. The integration of these multiple approaches in solving complex problems is of particular importance.
The journal presents new and original research and technological developments, addressing real and complex issues applicable to difficult problems. It provides a medium for exchanging scientific research and technological achievements accomplished by the international community.