Dynamic sparse and weight allocation-based text-driven person retrieval

IF 4.2 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Image and Vision Computing Pub Date : 2025-09-24 DOI:10.1016/j.imavis.2025.105737

Shuren Zhou , Qihang Zhou , Jiao Liu

{"title":"Dynamic sparse and weight allocation-based text-driven person retrieval","authors":"Shuren Zhou , Qihang Zhou , Jiao Liu","doi":"10.1016/j.imavis.2025.105737","DOIUrl":null,"url":null,"abstract":"<div><div>Text-to-image person retrieval aims to find the most matching personimages in a large-scale persondataset through textual descriptions. However, most of the existing methods have the following problems: (1) There are still some inaccurate matching pairs in the retrieval system, and the errors of these matching pairs negatively affect the performance of the whole retrieval system. (2) In the whole training process of the model, the whole text is used directly, but there are still non-important parts of the text that are not important for recognizing the images, and how to process the text effectively is still a hot topic in current research. These critical issues significantly degrade the retrieval performance. To this end, we propose a new alignment optimization framework for text-based person retrieval. Precisely, our framework consists of three key components: (1) progressive enhancement for a multimodal integration, which not only simulates coarse-grained alignment through mathematical modeling, but also appropriately combines coarse-grained and fine-grained alignment through progressive learning; (2) global bidirectional match filtering, which utilizes subjective logic to effectively mitigate the interference of incorrectly matched pairs of image text, and at the same time utilizes a bidirectional KL match filtering algorithm so as to select the matching pairs with high degree of image text matching for training; (3) fine-grained dynamic sparse mask modeling, which uses mask language modeling and constructs a dynamic spatial sparsification module, which not only applies more expressive modules to important positions but also mines the relationship between image text pairs at a fine-grained level, thus improving retrieval performance. Extensive experiments show that the method achieves state-of-the-art results on three benchmark datasets and performs well on domain generalization tasks.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"163 ","pages":"Article 105737"},"PeriodicalIF":4.2000,"publicationDate":"2025-09-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Image and Vision Computing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0262885625003257","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Text-to-image person retrieval aims to find the most matching personimages in a large-scale persondataset through textual descriptions. However, most of the existing methods have the following problems: (1) There are still some inaccurate matching pairs in the retrieval system, and the errors of these matching pairs negatively affect the performance of the whole retrieval system. (2) In the whole training process of the model, the whole text is used directly, but there are still non-important parts of the text that are not important for recognizing the images, and how to process the text effectively is still a hot topic in current research. These critical issues significantly degrade the retrieval performance. To this end, we propose a new alignment optimization framework for text-based person retrieval. Precisely, our framework consists of three key components: (1) progressive enhancement for a multimodal integration, which not only simulates coarse-grained alignment through mathematical modeling, but also appropriately combines coarse-grained and fine-grained alignment through progressive learning; (2) global bidirectional match filtering, which utilizes subjective logic to effectively mitigate the interference of incorrectly matched pairs of image text, and at the same time utilizes a bidirectional KL match filtering algorithm so as to select the matching pairs with high degree of image text matching for training; (3) fine-grained dynamic sparse mask modeling, which uses mask language modeling and constructs a dynamic spatial sparsification module, which not only applies more expressive modules to important positions but also mines the relationship between image text pairs at a fine-grained level, thus improving retrieval performance. Extensive experiments show that the method achieves state-of-the-art results on three benchmark datasets and performs well on domain generalization tasks.

Abstract Image

查看原文本刊更多论文

基于动态稀疏和权值分配的文本驱动人物检索

文本到图像的人物检索旨在通过文本描述在大规模的人物数据集中找到最匹配的人物图像。然而，现有的大多数方法都存在以下问题：(1)检索系统中仍然存在一些不准确的匹配对，这些匹配对的误差会对整个检索系统的性能产生负面影响。(2)在模型的整个训练过程中，直接使用了整个文本，但仍然存在文本中不重要的部分，这些部分对图像识别不重要，如何有效地处理文本仍然是当前研究的热点。这些关键问题显著降低了检索性能。为此，我们提出了一种新的基于文本的人物检索对齐优化框架。准确地说，我们的框架由三个关键部分组成：(1)对多模态集成的渐进增强，它不仅通过数学建模模拟粗粒度对齐，而且通过渐进学习将粗粒度和细粒度对齐适当地结合起来；(2)全局双向匹配滤波，利用主观逻辑有效减轻图像文本不正确匹配对的干扰，同时利用双向KL匹配滤波算法，选择图像文本匹配程度高的匹配对进行训练；(3)细粒度动态稀疏掩码建模，利用掩码语言建模，构建动态空间稀疏化模块，不仅在重要位置应用更多表达模块，而且在细粒度层面挖掘图像文本对之间的关系，从而提高检索性能。大量的实验表明，该方法在三个基准数据集上取得了最先进的结果，并且在领域泛化任务上表现良好。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Image and Vision Computing 工程技术-工程：电子与电气

CiteScore

8.50

自引率

8.50%

发文量

143

审稿时长

7.8 months

期刊介绍： Image and Vision Computing has as a primary aim the provision of an effective medium of interchange for the results of high quality theoretical and applied research fundamental to all aspects of image interpretation and computer vision. The journal publishes work that proposes new image interpretation and computer vision methodology or addresses the application of such methods to real world scenes. It seeks to strengthen a deeper understanding in the discipline by encouraging the quantitative comparison and performance evaluation of the proposed methodology. The coverage includes: image interpretation, scene modelling, object recognition and tracking, shape analysis, monitoring and surveillance, active vision and robotic systems, SLAM, biologically-inspired computer vision, motion analysis, stereo vision, document image understanding, character and handwritten text recognition, face and gesture recognition, biometrics, vision-based human-computer interaction, human activity and behavior understanding, data fusion from multiple sensor inputs, image databases.