基于隐式对齐的文本-图像人再识别跨模态共生网络

IF 8 1区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

IEEE Transactions on Information Forensics and Security Pub Date : 2025-07-31 DOI:10.1109/TIFS.2025.3594558

Rui Sun;Yun Du;Guoxi Huang;Xuebin Wang;Jingjing Wu

{"title":"基于隐式对齐的文本-图像人再识别跨模态共生网络","authors":"Rui Sun;Yun Du;Guoxi Huang;Xuebin Wang;Jingjing Wu","doi":"10.1109/TIFS.2025.3594558","DOIUrl":null,"url":null,"abstract":"Text-to-image person re-identification aims to utilize textual descriptions to retrieve specific person images from large image databases. The core challenge of this task lies in the significant feature differences between the abstract nature of text and the intuitiveness of images. Existing solutions primarily rely on explicit alignment of global or fine-grained local features, which lack flexibility and struggle to effectively capture and leverage subtle features and relationship information in multimodal data. Particularly, for different images of the same person, the emphasis in feature extraction should be adjusted according to the differences in text descriptions. To address these issues, this paper proposes a Cross-Modal Symbiotic Network (CMSN) based on implicit alignment. First, CMSN employs an Implicit Multi-scale Feature Integration (IMFI) module to implicitly extract and fuse multi-scale features from images and text, thereby adaptively capturing the feature relationships between the two modalities. Second, a Combined Representation Learning (CRL) module is used to produce a combined representation of the text and image features, utilizing a Combined-Representation Identity Alignment (CRIA) loss to align and constrain the identity centers of the three feature vectors. Finally, we design a Semi-Positive Triplet (SPT) loss function, which defines semi-positive samples using other images and texts of the same identity, providing additional supervisory information to the model and further reducing modality heterogeneity. Extensive experiments on the CUHK-PEDES dataset demonstrate that CMSN achieves an impressive Rank-1 and mAP accuracy of 76.46% and 70.28%, respectively, significantly outperforming existing SOTA methods.","PeriodicalId":13492,"journal":{"name":"IEEE Transactions on Information Forensics and Security","volume":"20 ","pages":"8069-8082"},"PeriodicalIF":8.0000,"publicationDate":"2025-07-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Implicit Alignment-Based Cross-Modal Symbiotic Network for Text-to-Image Person Re-Identification\",\"authors\":\"Rui Sun;Yun Du;Guoxi Huang;Xuebin Wang;Jingjing Wu\",\"doi\":\"10.1109/TIFS.2025.3594558\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Text-to-image person re-identification aims to utilize textual descriptions to retrieve specific person images from large image databases. The core challenge of this task lies in the significant feature differences between the abstract nature of text and the intuitiveness of images. Existing solutions primarily rely on explicit alignment of global or fine-grained local features, which lack flexibility and struggle to effectively capture and leverage subtle features and relationship information in multimodal data. Particularly, for different images of the same person, the emphasis in feature extraction should be adjusted according to the differences in text descriptions. To address these issues, this paper proposes a Cross-Modal Symbiotic Network (CMSN) based on implicit alignment. First, CMSN employs an Implicit Multi-scale Feature Integration (IMFI) module to implicitly extract and fuse multi-scale features from images and text, thereby adaptively capturing the feature relationships between the two modalities. Second, a Combined Representation Learning (CRL) module is used to produce a combined representation of the text and image features, utilizing a Combined-Representation Identity Alignment (CRIA) loss to align and constrain the identity centers of the three feature vectors. Finally, we design a Semi-Positive Triplet (SPT) loss function, which defines semi-positive samples using other images and texts of the same identity, providing additional supervisory information to the model and further reducing modality heterogeneity. Extensive experiments on the CUHK-PEDES dataset demonstrate that CMSN achieves an impressive Rank-1 and mAP accuracy of 76.46% and 70.28%, respectively, significantly outperforming existing SOTA methods.\",\"PeriodicalId\":13492,\"journal\":{\"name\":\"IEEE Transactions on Information Forensics and Security\",\"volume\":\"20 \",\"pages\":\"8069-8082\"},\"PeriodicalIF\":8.0000,\"publicationDate\":\"2025-07-31\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Information Forensics and Security\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/11105519/\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, THEORY & METHODS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Information Forensics and Security","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/11105519/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}

引用次数: 0

摘要

文本到图像的人物再识别旨在利用文本描述从大型图像数据库中检索特定的人物图像。该任务的核心挑战在于文本的抽象性与图像的直观性之间的显著特征差异。现有的解决方案主要依赖于全局或细粒度局部特征的显式对齐，这缺乏灵活性，并且难以有效地捕获和利用多模式数据中的细微特征和关系信息。特别是对于同一人的不同图像，特征提取的重点要根据文字描述的不同进行调整。针对这些问题，本文提出了一种基于隐式对齐的跨模态共生网络（CMSN）。首先，CMSN采用隐式多尺度特征集成（IMFI）模块从图像和文本中隐式提取和融合多尺度特征，从而自适应捕获两种模式之间的特征关系。其次，使用组合表示学习（CRL）模块生成文本和图像特征的组合表示，利用组合表示身份对齐（CRIA）损失来对齐和约束三个特征向量的身份中心。最后，我们设计了一个半正三重态（SPT）损失函数，它使用相同身份的其他图像和文本来定义半正样本，为模型提供额外的监督信息，并进一步降低模态异质性。在中大- pedes数据集上的大量实验表明，CMSN的Rank-1和mAP准确率分别达到76.46%和70.28%，显著优于现有的SOTA方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Implicit Alignment-Based Cross-Modal Symbiotic Network for Text-to-Image Person Re-Identification

Text-to-image person re-identification aims to utilize textual descriptions to retrieve specific person images from large image databases. The core challenge of this task lies in the significant feature differences between the abstract nature of text and the intuitiveness of images. Existing solutions primarily rely on explicit alignment of global or fine-grained local features, which lack flexibility and struggle to effectively capture and leverage subtle features and relationship information in multimodal data. Particularly, for different images of the same person, the emphasis in feature extraction should be adjusted according to the differences in text descriptions. To address these issues, this paper proposes a Cross-Modal Symbiotic Network (CMSN) based on implicit alignment. First, CMSN employs an Implicit Multi-scale Feature Integration (IMFI) module to implicitly extract and fuse multi-scale features from images and text, thereby adaptively capturing the feature relationships between the two modalities. Second, a Combined Representation Learning (CRL) module is used to produce a combined representation of the text and image features, utilizing a Combined-Representation Identity Alignment (CRIA) loss to align and constrain the identity centers of the three feature vectors. Finally, we design a Semi-Positive Triplet (SPT) loss function, which defines semi-positive samples using other images and texts of the same identity, providing additional supervisory information to the model and further reducing modality heterogeneity. Extensive experiments on the CUHK-PEDES dataset demonstrate that CMSN achieves an impressive Rank-1 and mAP accuracy of 76.46% and 70.28%, respectively, significantly outperforming existing SOTA methods.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE Transactions on Information Forensics and Security 工程技术-工程：电子与电气

CiteScore

14.40

自引率

7.40%

发文量

234

审稿时长

6.5 months

期刊介绍： The IEEE Transactions on Information Forensics and Security covers the sciences, technologies, and applications relating to information forensics, information security, biometrics, surveillance and systems applications that incorporate these features