Jiaqi Zhao , Ao Fu , Yong Zhou , Wen-liang Du , Rui Yao
{"title":"基于文本的人物搜索的细粒度语义导向嵌入集对齐","authors":"Jiaqi Zhao , Ao Fu , Yong Zhou , Wen-liang Du , Rui Yao","doi":"10.1016/j.imavis.2024.105309","DOIUrl":null,"url":null,"abstract":"<div><div>Text-based person search aims to retrieve images of a person that are highly semantically relevant to a given textual description. The difficulty of this retrieval task is modality heterogeneity and fine-grained matching. Most existing methods only consider the alignment using global features, ignoring the fine-grained matching problem. The cross-modal attention interactions are popularly used for image patches and text markers for direct alignment. However, cross-modal attention may cause a huge overhead in the reasoning stage and cannot be applied in actual scenarios. In addition, it is unreasonable to perform patch-token alignment, since image patches and text tokens do not have complete semantic information. This paper proposes an Embedding Set Alignment (ESA) module for fine-grained alignment. The module can preserve fine-grained semantic information by merging token-level features into embedding sets. The ESA module benefits from pre-trained cross-modal large models, and it can be combined with the backbone non-intrusively and trained in an end-to-end manner. In addition, an Adaptive Semantic Margin (ASM) loss is designed to describe the alignment of embedding sets, instead of adapting a loss function with a fixed margin. Extensive experiments demonstrate that our proposed fine-grained semantic embedding set alignment method achieves state-of-the-art performance on three popular benchmark datasets, surpassing the previous best methods.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"152 ","pages":"Article 105309"},"PeriodicalIF":4.2000,"publicationDate":"2024-11-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Fine-grained semantic oriented embedding set alignment for text-based person search\",\"authors\":\"Jiaqi Zhao , Ao Fu , Yong Zhou , Wen-liang Du , Rui Yao\",\"doi\":\"10.1016/j.imavis.2024.105309\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Text-based person search aims to retrieve images of a person that are highly semantically relevant to a given textual description. The difficulty of this retrieval task is modality heterogeneity and fine-grained matching. Most existing methods only consider the alignment using global features, ignoring the fine-grained matching problem. The cross-modal attention interactions are popularly used for image patches and text markers for direct alignment. However, cross-modal attention may cause a huge overhead in the reasoning stage and cannot be applied in actual scenarios. In addition, it is unreasonable to perform patch-token alignment, since image patches and text tokens do not have complete semantic information. This paper proposes an Embedding Set Alignment (ESA) module for fine-grained alignment. The module can preserve fine-grained semantic information by merging token-level features into embedding sets. The ESA module benefits from pre-trained cross-modal large models, and it can be combined with the backbone non-intrusively and trained in an end-to-end manner. In addition, an Adaptive Semantic Margin (ASM) loss is designed to describe the alignment of embedding sets, instead of adapting a loss function with a fixed margin. Extensive experiments demonstrate that our proposed fine-grained semantic embedding set alignment method achieves state-of-the-art performance on three popular benchmark datasets, surpassing the previous best methods.</div></div>\",\"PeriodicalId\":50374,\"journal\":{\"name\":\"Image and Vision Computing\",\"volume\":\"152 \",\"pages\":\"Article 105309\"},\"PeriodicalIF\":4.2000,\"publicationDate\":\"2024-11-05\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Image and Vision Computing\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0262885624004141\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Image and Vision Computing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0262885624004141","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
Fine-grained semantic oriented embedding set alignment for text-based person search
Text-based person search aims to retrieve images of a person that are highly semantically relevant to a given textual description. The difficulty of this retrieval task is modality heterogeneity and fine-grained matching. Most existing methods only consider the alignment using global features, ignoring the fine-grained matching problem. The cross-modal attention interactions are popularly used for image patches and text markers for direct alignment. However, cross-modal attention may cause a huge overhead in the reasoning stage and cannot be applied in actual scenarios. In addition, it is unreasonable to perform patch-token alignment, since image patches and text tokens do not have complete semantic information. This paper proposes an Embedding Set Alignment (ESA) module for fine-grained alignment. The module can preserve fine-grained semantic information by merging token-level features into embedding sets. The ESA module benefits from pre-trained cross-modal large models, and it can be combined with the backbone non-intrusively and trained in an end-to-end manner. In addition, an Adaptive Semantic Margin (ASM) loss is designed to describe the alignment of embedding sets, instead of adapting a loss function with a fixed margin. Extensive experiments demonstrate that our proposed fine-grained semantic embedding set alignment method achieves state-of-the-art performance on three popular benchmark datasets, surpassing the previous best methods.
期刊介绍:
Image and Vision Computing has as a primary aim the provision of an effective medium of interchange for the results of high quality theoretical and applied research fundamental to all aspects of image interpretation and computer vision. The journal publishes work that proposes new image interpretation and computer vision methodology or addresses the application of such methods to real world scenes. It seeks to strengthen a deeper understanding in the discipline by encouraging the quantitative comparison and performance evaluation of the proposed methodology. The coverage includes: image interpretation, scene modelling, object recognition and tracking, shape analysis, monitoring and surveillance, active vision and robotic systems, SLAM, biologically-inspired computer vision, motion analysis, stereo vision, document image understanding, character and handwritten text recognition, face and gesture recognition, biometrics, vision-based human-computer interaction, human activity and behavior understanding, data fusion from multiple sensor inputs, image databases.