基于文本的人物搜索的细粒度语义导向嵌入集对齐

IF 4.2 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Image and Vision Computing Pub Date : 2024-11-05 DOI:10.1016/j.imavis.2024.105309

Jiaqi Zhao , Ao Fu , Yong Zhou , Wen-liang Du , Rui Yao

{"title":"基于文本的人物搜索的细粒度语义导向嵌入集对齐","authors":"Jiaqi Zhao , Ao Fu , Yong Zhou , Wen-liang Du , Rui Yao","doi":"10.1016/j.imavis.2024.105309","DOIUrl":null,"url":null,"abstract":"<div><div>Text-based person search aims to retrieve images of a person that are highly semantically relevant to a given textual description. The difficulty of this retrieval task is modality heterogeneity and fine-grained matching. Most existing methods only consider the alignment using global features, ignoring the fine-grained matching problem. The cross-modal attention interactions are popularly used for image patches and text markers for direct alignment. However, cross-modal attention may cause a huge overhead in the reasoning stage and cannot be applied in actual scenarios. In addition, it is unreasonable to perform patch-token alignment, since image patches and text tokens do not have complete semantic information. This paper proposes an Embedding Set Alignment (ESA) module for fine-grained alignment. The module can preserve fine-grained semantic information by merging token-level features into embedding sets. The ESA module benefits from pre-trained cross-modal large models, and it can be combined with the backbone non-intrusively and trained in an end-to-end manner. In addition, an Adaptive Semantic Margin (ASM) loss is designed to describe the alignment of embedding sets, instead of adapting a loss function with a fixed margin. Extensive experiments demonstrate that our proposed fine-grained semantic embedding set alignment method achieves state-of-the-art performance on three popular benchmark datasets, surpassing the previous best methods.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"152 ","pages":"Article 105309"},"PeriodicalIF":4.2000,"publicationDate":"2024-11-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Fine-grained semantic oriented embedding set alignment for text-based person search\",\"authors\":\"Jiaqi Zhao , Ao Fu , Yong Zhou , Wen-liang Du , Rui Yao\",\"doi\":\"10.1016/j.imavis.2024.105309\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Text-based person search aims to retrieve images of a person that are highly semantically relevant to a given textual description. The difficulty of this retrieval task is modality heterogeneity and fine-grained matching. Most existing methods only consider the alignment using global features, ignoring the fine-grained matching problem. The cross-modal attention interactions are popularly used for image patches and text markers for direct alignment. However, cross-modal attention may cause a huge overhead in the reasoning stage and cannot be applied in actual scenarios. In addition, it is unreasonable to perform patch-token alignment, since image patches and text tokens do not have complete semantic information. This paper proposes an Embedding Set Alignment (ESA) module for fine-grained alignment. The module can preserve fine-grained semantic information by merging token-level features into embedding sets. The ESA module benefits from pre-trained cross-modal large models, and it can be combined with the backbone non-intrusively and trained in an end-to-end manner. In addition, an Adaptive Semantic Margin (ASM) loss is designed to describe the alignment of embedding sets, instead of adapting a loss function with a fixed margin. Extensive experiments demonstrate that our proposed fine-grained semantic embedding set alignment method achieves state-of-the-art performance on three popular benchmark datasets, surpassing the previous best methods.</div></div>\",\"PeriodicalId\":50374,\"journal\":{\"name\":\"Image and Vision Computing\",\"volume\":\"152 \",\"pages\":\"Article 105309\"},\"PeriodicalIF\":4.2000,\"publicationDate\":\"2024-11-05\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Image and Vision Computing\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0262885624004141\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Image and Vision Computing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0262885624004141","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

基于文本的人物搜索旨在检索与给定文本描述语义高度相关的人物图像。这种检索任务的难点在于模态异质性和细粒度匹配。大多数现有方法只考虑使用全局特征进行配准，而忽略了细粒度匹配问题。跨模态注意力交互通常用于图像补丁和文本标记的直接配准。然而，跨模态注意力可能会在推理阶段造成巨大的开销，无法应用于实际场景。此外，由于图像补丁和文本标记不具备完整的语义信息，因此进行补丁-标记对齐是不合理的。本文提出了一种用于细粒度对齐的嵌入集对齐（ESA）模块。该模块可通过将标记级特征合并到嵌入集来保留细粒度语义信息。ESA 模块得益于预先训练好的跨模态大型模型，它可以与骨干网非侵入式结合，并以端到端的方式进行训练。此外，我们还设计了自适应语义边际（ASM）损失来描述嵌入集的对齐情况，而不是采用具有固定边际的损失函数。广泛的实验证明，我们提出的细粒度语义嵌入集对齐方法在三个流行的基准数据集上取得了一流的性能，超过了以前的最佳方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Fine-grained semantic oriented embedding set alignment for text-based person search

Text-based person search aims to retrieve images of a person that are highly semantically relevant to a given textual description. The difficulty of this retrieval task is modality heterogeneity and fine-grained matching. Most existing methods only consider the alignment using global features, ignoring the fine-grained matching problem. The cross-modal attention interactions are popularly used for image patches and text markers for direct alignment. However, cross-modal attention may cause a huge overhead in the reasoning stage and cannot be applied in actual scenarios. In addition, it is unreasonable to perform patch-token alignment, since image patches and text tokens do not have complete semantic information. This paper proposes an Embedding Set Alignment (ESA) module for fine-grained alignment. The module can preserve fine-grained semantic information by merging token-level features into embedding sets. The ESA module benefits from pre-trained cross-modal large models, and it can be combined with the backbone non-intrusively and trained in an end-to-end manner. In addition, an Adaptive Semantic Margin (ASM) loss is designed to describe the alignment of embedding sets, instead of adapting a loss function with a fixed margin. Extensive experiments demonstrate that our proposed fine-grained semantic embedding set alignment method achieves state-of-the-art performance on three popular benchmark datasets, surpassing the previous best methods.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Image and Vision Computing 工程技术-工程：电子与电气

CiteScore

8.50

自引率

8.50%

发文量

143

审稿时长

7.8 months

期刊介绍： Image and Vision Computing has as a primary aim the provision of an effective medium of interchange for the results of high quality theoretical and applied research fundamental to all aspects of image interpretation and computer vision. The journal publishes work that proposes new image interpretation and computer vision methodology or addresses the application of such methods to real world scenes. It seeks to strengthen a deeper understanding in the discipline by encouraging the quantitative comparison and performance evaluation of the proposed methodology. The coverage includes: image interpretation, scene modelling, object recognition and tracking, shape analysis, monitoring and surveillance, active vision and robotic systems, SLAM, biologically-inspired computer vision, motion analysis, stereo vision, document image understanding, character and handwritten text recognition, face and gesture recognition, biometrics, vision-based human-computer interaction, human activity and behavior understanding, data fusion from multiple sensor inputs, image databases.