{"title":"基于隐式对齐的文本-图像人再识别跨模态共生网络","authors":"Rui Sun;Yun Du;Guoxi Huang;Xuebin Wang;Jingjing Wu","doi":"10.1109/TIFS.2025.3594558","DOIUrl":null,"url":null,"abstract":"Text-to-image person re-identification aims to utilize textual descriptions to retrieve specific person images from large image databases. The core challenge of this task lies in the significant feature differences between the abstract nature of text and the intuitiveness of images. Existing solutions primarily rely on explicit alignment of global or fine-grained local features, which lack flexibility and struggle to effectively capture and leverage subtle features and relationship information in multimodal data. Particularly, for different images of the same person, the emphasis in feature extraction should be adjusted according to the differences in text descriptions. To address these issues, this paper proposes a Cross-Modal Symbiotic Network (CMSN) based on implicit alignment. First, CMSN employs an Implicit Multi-scale Feature Integration (IMFI) module to implicitly extract and fuse multi-scale features from images and text, thereby adaptively capturing the feature relationships between the two modalities. Second, a Combined Representation Learning (CRL) module is used to produce a combined representation of the text and image features, utilizing a Combined-Representation Identity Alignment (CRIA) loss to align and constrain the identity centers of the three feature vectors. Finally, we design a Semi-Positive Triplet (SPT) loss function, which defines semi-positive samples using other images and texts of the same identity, providing additional supervisory information to the model and further reducing modality heterogeneity. Extensive experiments on the CUHK-PEDES dataset demonstrate that CMSN achieves an impressive Rank-1 and mAP accuracy of 76.46% and 70.28%, respectively, significantly outperforming existing SOTA methods.","PeriodicalId":13492,"journal":{"name":"IEEE Transactions on Information Forensics and Security","volume":"20 ","pages":"8069-8082"},"PeriodicalIF":8.0000,"publicationDate":"2025-07-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Implicit Alignment-Based Cross-Modal Symbiotic Network for Text-to-Image Person Re-Identification\",\"authors\":\"Rui Sun;Yun Du;Guoxi Huang;Xuebin Wang;Jingjing Wu\",\"doi\":\"10.1109/TIFS.2025.3594558\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Text-to-image person re-identification aims to utilize textual descriptions to retrieve specific person images from large image databases. The core challenge of this task lies in the significant feature differences between the abstract nature of text and the intuitiveness of images. Existing solutions primarily rely on explicit alignment of global or fine-grained local features, which lack flexibility and struggle to effectively capture and leverage subtle features and relationship information in multimodal data. Particularly, for different images of the same person, the emphasis in feature extraction should be adjusted according to the differences in text descriptions. To address these issues, this paper proposes a Cross-Modal Symbiotic Network (CMSN) based on implicit alignment. First, CMSN employs an Implicit Multi-scale Feature Integration (IMFI) module to implicitly extract and fuse multi-scale features from images and text, thereby adaptively capturing the feature relationships between the two modalities. Second, a Combined Representation Learning (CRL) module is used to produce a combined representation of the text and image features, utilizing a Combined-Representation Identity Alignment (CRIA) loss to align and constrain the identity centers of the three feature vectors. Finally, we design a Semi-Positive Triplet (SPT) loss function, which defines semi-positive samples using other images and texts of the same identity, providing additional supervisory information to the model and further reducing modality heterogeneity. Extensive experiments on the CUHK-PEDES dataset demonstrate that CMSN achieves an impressive Rank-1 and mAP accuracy of 76.46% and 70.28%, respectively, significantly outperforming existing SOTA methods.\",\"PeriodicalId\":13492,\"journal\":{\"name\":\"IEEE Transactions on Information Forensics and Security\",\"volume\":\"20 \",\"pages\":\"8069-8082\"},\"PeriodicalIF\":8.0000,\"publicationDate\":\"2025-07-31\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Information Forensics and Security\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/11105519/\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, THEORY & METHODS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Information Forensics and Security","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/11105519/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}
Implicit Alignment-Based Cross-Modal Symbiotic Network for Text-to-Image Person Re-Identification
Text-to-image person re-identification aims to utilize textual descriptions to retrieve specific person images from large image databases. The core challenge of this task lies in the significant feature differences between the abstract nature of text and the intuitiveness of images. Existing solutions primarily rely on explicit alignment of global or fine-grained local features, which lack flexibility and struggle to effectively capture and leverage subtle features and relationship information in multimodal data. Particularly, for different images of the same person, the emphasis in feature extraction should be adjusted according to the differences in text descriptions. To address these issues, this paper proposes a Cross-Modal Symbiotic Network (CMSN) based on implicit alignment. First, CMSN employs an Implicit Multi-scale Feature Integration (IMFI) module to implicitly extract and fuse multi-scale features from images and text, thereby adaptively capturing the feature relationships between the two modalities. Second, a Combined Representation Learning (CRL) module is used to produce a combined representation of the text and image features, utilizing a Combined-Representation Identity Alignment (CRIA) loss to align and constrain the identity centers of the three feature vectors. Finally, we design a Semi-Positive Triplet (SPT) loss function, which defines semi-positive samples using other images and texts of the same identity, providing additional supervisory information to the model and further reducing modality heterogeneity. Extensive experiments on the CUHK-PEDES dataset demonstrate that CMSN achieves an impressive Rank-1 and mAP accuracy of 76.46% and 70.28%, respectively, significantly outperforming existing SOTA methods.
期刊介绍:
The IEEE Transactions on Information Forensics and Security covers the sciences, technologies, and applications relating to information forensics, information security, biometrics, surveillance and systems applications that incorporate these features