{"title":"Dynamic sparse and weight allocation-based text-driven person retrieval","authors":"Shuren Zhou , Qihang Zhou , Jiao Liu","doi":"10.1016/j.imavis.2025.105737","DOIUrl":null,"url":null,"abstract":"<div><div>Text-to-image person retrieval aims to find the most matching personimages in a large-scale persondataset through textual descriptions. However, most of the existing methods have the following problems: (1) There are still some inaccurate matching pairs in the retrieval system, and the errors of these matching pairs negatively affect the performance of the whole retrieval system. (2) In the whole training process of the model, the whole text is used directly, but there are still non-important parts of the text that are not important for recognizing the images, and how to process the text effectively is still a hot topic in current research. These critical issues significantly degrade the retrieval performance. To this end, we propose a new alignment optimization framework for text-based person retrieval. Precisely, our framework consists of three key components: (1) progressive enhancement for a multimodal integration, which not only simulates coarse-grained alignment through mathematical modeling, but also appropriately combines coarse-grained and fine-grained alignment through progressive learning; (2) global bidirectional match filtering, which utilizes subjective logic to effectively mitigate the interference of incorrectly matched pairs of image text, and at the same time utilizes a bidirectional KL match filtering algorithm so as to select the matching pairs with high degree of image text matching for training; (3) fine-grained dynamic sparse mask modeling, which uses mask language modeling and constructs a dynamic spatial sparsification module, which not only applies more expressive modules to important positions but also mines the relationship between image text pairs at a fine-grained level, thus improving retrieval performance. Extensive experiments show that the method achieves state-of-the-art results on three benchmark datasets and performs well on domain generalization tasks.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"163 ","pages":"Article 105737"},"PeriodicalIF":4.2000,"publicationDate":"2025-09-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Image and Vision Computing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0262885625003257","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Text-to-image person retrieval aims to find the most matching personimages in a large-scale persondataset through textual descriptions. However, most of the existing methods have the following problems: (1) There are still some inaccurate matching pairs in the retrieval system, and the errors of these matching pairs negatively affect the performance of the whole retrieval system. (2) In the whole training process of the model, the whole text is used directly, but there are still non-important parts of the text that are not important for recognizing the images, and how to process the text effectively is still a hot topic in current research. These critical issues significantly degrade the retrieval performance. To this end, we propose a new alignment optimization framework for text-based person retrieval. Precisely, our framework consists of three key components: (1) progressive enhancement for a multimodal integration, which not only simulates coarse-grained alignment through mathematical modeling, but also appropriately combines coarse-grained and fine-grained alignment through progressive learning; (2) global bidirectional match filtering, which utilizes subjective logic to effectively mitigate the interference of incorrectly matched pairs of image text, and at the same time utilizes a bidirectional KL match filtering algorithm so as to select the matching pairs with high degree of image text matching for training; (3) fine-grained dynamic sparse mask modeling, which uses mask language modeling and constructs a dynamic spatial sparsification module, which not only applies more expressive modules to important positions but also mines the relationship between image text pairs at a fine-grained level, thus improving retrieval performance. Extensive experiments show that the method achieves state-of-the-art results on three benchmark datasets and performs well on domain generalization tasks.
期刊介绍:
Image and Vision Computing has as a primary aim the provision of an effective medium of interchange for the results of high quality theoretical and applied research fundamental to all aspects of image interpretation and computer vision. The journal publishes work that proposes new image interpretation and computer vision methodology or addresses the application of such methods to real world scenes. It seeks to strengthen a deeper understanding in the discipline by encouraging the quantitative comparison and performance evaluation of the proposed methodology. The coverage includes: image interpretation, scene modelling, object recognition and tracking, shape analysis, monitoring and surveillance, active vision and robotic systems, SLAM, biologically-inspired computer vision, motion analysis, stereo vision, document image understanding, character and handwritten text recognition, face and gesture recognition, biometrics, vision-based human-computer interaction, human activity and behavior understanding, data fusion from multiple sensor inputs, image databases.