Text-guided Image Restoration and Semantic Enhancement for Text-to-Image Person Retrieval

IF 6 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Neural Networks Pub Date : 2024-12-16 DOI:10.1016/j.neunet.2024.107028

Delong Liu , Haiwen Li , Zhicheng Zhao , Yuan Dong

{"title":"Text-guided Image Restoration and Semantic Enhancement for Text-to-Image Person Retrieval","authors":"Delong Liu , Haiwen Li , Zhicheng Zhao , Yuan Dong","doi":"10.1016/j.neunet.2024.107028","DOIUrl":null,"url":null,"abstract":"<div><div>The goal of Text-to-Image Person Retrieval (TIPR) is to retrieve specific person images according to the given textual descriptions. A primary challenge in this task is bridging the substantial representational gap between visual and textual modalities. The prevailing methods map texts and images into unified embedding space for matching, while the intricate semantic correspondences between texts and images are still not effectively constructed. To address this issue, we propose a novel TIPR framework to build fine-grained interactions and alignment between person images and the corresponding texts. Specifically, via fine-tuning the Contrastive Language-Image Pre-training (CLIP) model, a visual–textual dual encoder is firstly constructed, to preliminarily align the image and text features. Secondly, a Text-guided Image Restoration (TIR) auxiliary task is proposed to map abstract textual entities to specific image regions, improving the alignment between local textual and visual embeddings. Additionally, a cross-modal triplet loss is presented to handle hard samples, and further enhance the model’s discriminability for minor differences. Moreover, a pruning-based text data augmentation approach is proposed to enhance focus on essential elements in descriptions, thereby avoiding excessive model attention to less significant information. The experimental results show our proposed method outperforms state-of-the-art methods on three popular benchmark datasets, and the code will be made publicly available at <span><span>https://github.com/Delong-liu-bupt/SEN</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":49763,"journal":{"name":"Neural Networks","volume":"184 ","pages":"Article 107028"},"PeriodicalIF":6.0000,"publicationDate":"2024-12-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Neural Networks","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0893608024009572","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

The goal of Text-to-Image Person Retrieval (TIPR) is to retrieve specific person images according to the given textual descriptions. A primary challenge in this task is bridging the substantial representational gap between visual and textual modalities. The prevailing methods map texts and images into unified embedding space for matching, while the intricate semantic correspondences between texts and images are still not effectively constructed. To address this issue, we propose a novel TIPR framework to build fine-grained interactions and alignment between person images and the corresponding texts. Specifically, via fine-tuning the Contrastive Language-Image Pre-training (CLIP) model, a visual–textual dual encoder is firstly constructed, to preliminarily align the image and text features. Secondly, a Text-guided Image Restoration (TIR) auxiliary task is proposed to map abstract textual entities to specific image regions, improving the alignment between local textual and visual embeddings. Additionally, a cross-modal triplet loss is presented to handle hard samples, and further enhance the model’s discriminability for minor differences. Moreover, a pruning-based text data augmentation approach is proposed to enhance focus on essential elements in descriptions, thereby avoiding excessive model attention to less significant information. The experimental results show our proposed method outperforms state-of-the-art methods on three popular benchmark datasets, and the code will be made publicly available at https://github.com/Delong-liu-bupt/SEN.

查看原文本刊更多论文

文本引导图像恢复和语义增强的文本到图像人物检索。

文本到图像人物检索（TIPR）的目标是根据给定的文本描述检索特定的人物图像。这项任务的主要挑战是弥合视觉和文本模式之间的实质性代表性差距。现有的方法将文本和图像映射到统一的嵌入空间中进行匹配，但仍然没有有效地构建文本和图像之间复杂的语义对应关系。为了解决这个问题，我们提出了一个新的TIPR框架来构建人物图像和相应文本之间的细粒度交互和对齐。具体而言，通过对对比语言-图像预训练（CLIP）模型进行微调，首先构建视觉-文本双编码器，初步对齐图像和文本特征。其次，提出了一种文本引导图像恢复（TIR）辅助任务，将抽象文本实体映射到特定的图像区域，提高局部文本与视觉嵌入之间的一致性。此外，提出了一个跨模态三联体损失来处理硬样本，并进一步增强了模型对微小差异的可辨别性。此外，提出了一种基于剪枝的文本数据增强方法，以增强对描述中基本元素的关注，从而避免模型过度关注不重要的信息。实验结果表明，我们提出的方法在三个流行的基准数据集上优于最先进的方法，代码将在https://github.com/Delong-liu-bupt/SEN上公开提供。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Neural Networks 工程技术-计算机：人工智能

CiteScore

13.90

自引率

7.70%

发文量

425

审稿时长

67 days

期刊介绍： Neural Networks is a platform that aims to foster an international community of scholars and practitioners interested in neural networks, deep learning, and other approaches to artificial intelligence and machine learning. Our journal invites submissions covering various aspects of neural networks research, from computational neuroscience and cognitive modeling to mathematical analyses and engineering applications. By providing a forum for interdisciplinary discussions between biology and technology, we aim to encourage the development of biologically-inspired artificial intelligence.