Learning Semantic Polymorphic Mapping for Text-Based Person Retrieval

IF 8.4 1区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Multimedia Pub Date : 2024-06-05 DOI:10.1109/TMM.2024.3410129

Jiayi Li;Min Jiang;Jun Kong;Xuefeng Tao;Xi Luo

{"title":"Learning Semantic Polymorphic Mapping for Text-Based Person Retrieval","authors":"Jiayi Li;Min Jiang;Jun Kong;Xuefeng Tao;Xi Luo","doi":"10.1109/TMM.2024.3410129","DOIUrl":null,"url":null,"abstract":"Text-Based Person Retrieval (TBPR) aims to identify a particular individual within an extensive image gallery using text as the query. The principal challenge inherent in the TBPR task revolves around how to map cross-modal information to a potential common space and learn a generic representation. Previous methods have primarily focused on aligning singular text-image pairs, disregarding the inherent polymorphism within both images and natural language expressions for the same individual. Moreover, these methods have also ignored the impact of semantic polymorphism-based intra-modal data distribution on cross-modal matching. Recent methods employ cross-modal implicit information reconstruction to enhance inter-modal connections. However, the process of information reconstruction remains ambiguous. To address these issues, we propose the Learning Semantic Polymorphic Mapping (LSPM) framework, facilitated by the prowess of pre-trained cross-modal models. Firstly, to learn cross-modal information representations with better robustness, we design the Inter-modal Information Aggregation (Inter-IA) module to achieve cross-modal polymorphic mapping, fortifying the foundation of our information representations. Secondly, to attain a more concentrated intra-modal information representation based on semantic polymorphism, we design Intra-modal Information Aggregation (Intra-IA) module to further constrain the embeddings. Thirdly, to further explore the potential of cross-modal interactions within the model, we design the implicit reasoning module, Masked Information Guided Reconstruction (MIGR), with constraint guidance to elevate overall performance. Extensive experiments on both CUHK-PEDES and ICFG-PEDES datasets show that we achieve state-of-the-art results on Rank-1, mAP and mINP compared to existing methods.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"26 ","pages":"10678-10691"},"PeriodicalIF":8.4000,"publicationDate":"2024-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Multimedia","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10549861/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

Text-Based Person Retrieval (TBPR) aims to identify a particular individual within an extensive image gallery using text as the query. The principal challenge inherent in the TBPR task revolves around how to map cross-modal information to a potential common space and learn a generic representation. Previous methods have primarily focused on aligning singular text-image pairs, disregarding the inherent polymorphism within both images and natural language expressions for the same individual. Moreover, these methods have also ignored the impact of semantic polymorphism-based intra-modal data distribution on cross-modal matching. Recent methods employ cross-modal implicit information reconstruction to enhance inter-modal connections. However, the process of information reconstruction remains ambiguous. To address these issues, we propose the Learning Semantic Polymorphic Mapping (LSPM) framework, facilitated by the prowess of pre-trained cross-modal models. Firstly, to learn cross-modal information representations with better robustness, we design the Inter-modal Information Aggregation (Inter-IA) module to achieve cross-modal polymorphic mapping, fortifying the foundation of our information representations. Secondly, to attain a more concentrated intra-modal information representation based on semantic polymorphism, we design Intra-modal Information Aggregation (Intra-IA) module to further constrain the embeddings. Thirdly, to further explore the potential of cross-modal interactions within the model, we design the implicit reasoning module, Masked Information Guided Reconstruction (MIGR), with constraint guidance to elevate overall performance. Extensive experiments on both CUHK-PEDES and ICFG-PEDES datasets show that we achieve state-of-the-art results on Rank-1, mAP and mINP compared to existing methods.

查看原文本刊更多论文

学习语义多态映射，实现基于文本的人员检索

基于文本的人物检索（TBPR）旨在使用文本作为查询，在大量图像库中识别特定的个人。TBPR 任务的主要挑战在于如何将跨模态信息映射到潜在的共同空间并学习通用表示法。以往的方法主要侧重于对齐单一的文本-图像对，而忽略了同一个体的图像和自然语言表达中固有的多态性。此外，这些方法还忽略了基于语义多态性的模态内数据分布对跨模态匹配的影响。最近的方法采用了跨模态隐式信息重建来增强模态间的联系。然而，信息重建的过程仍然模糊不清。为了解决这些问题，我们提出了学习语义多态映射（LSPM）框架，并利用预先训练好的跨模态模型的优势加以促进。首先，为了更好地学习跨模态信息表征，我们设计了跨模态信息聚合（Inter-IA）模块来实现跨模态多态映射，从而巩固信息表征的基础。其次，为了在语义多态性的基础上实现更集中的模内信息表征，我们设计了模内信息聚合（Intra-IA）模块来进一步约束嵌入。第三，为了进一步挖掘模型中跨模态交互的潜力，我们设计了具有约束引导功能的隐式推理模块--屏蔽信息引导重构（MIGR），以提高整体性能。在 CUHK-PEDES 和 ICFG-PEDES 数据集上的广泛实验表明，与现有方法相比，我们在 Rank-1、mAP 和 mINP 方面取得了最先进的结果。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Transactions on Multimedia 工程技术-电信学

CiteScore

11.70

自引率

11.00%

发文量

576

审稿时长

5.5 months

期刊介绍： The IEEE Transactions on Multimedia delves into diverse aspects of multimedia technology and applications, covering circuits, networking, signal processing, systems, software, and systems integration. The scope aligns with the Fields of Interest of the sponsors, ensuring a comprehensive exploration of research in multimedia.