{"title":"融合表示、互反学习与自适应损失细化相结合增强基于文本的人物检索。","authors":"Anh D. Nguyen;Hoa N. Nguyen","doi":"10.1109/TIP.2025.3594880","DOIUrl":null,"url":null,"abstract":"Text-based person retrieval is defined as the challenging task of searching for people’s images based on given textual queries in natural language. Conventional methods primarily use deep neural networks to understand the relationship between visual and textual data, creating a shared feature space for cross-modal matching. The absence of awareness regarding variations in feature granularity between the two modalities, coupled with the diverse poses and viewing angles of images corresponding to the same individual, may lead to overlooking significant differences within each modality and across modalities, despite notable enhancements. Furthermore, the inconsistency in caption queries in large public datasets presents an additional obstacle to cross-modality mapping learning. Therefore, we introduce 3RTPR, a novel text-based person retrieval method that integrates a representation fusing mechanism and an adaptive loss refinement algorithm into a dual-encoder branch architecture. Moreover, we propose training two independent models simultaneously, which reciprocally support each other to enhance learning effectiveness. Consequently, our approach encompasses three significant contributions: (i) proposing a fused representation method to generate more discriminative representations for images and captions; (ii) introducing a novel algorithm to adjust loss and prioritize samples that contain valuable information; and (iii) proposing reciprocal learning involving a pair of independent models, which allows us to enhance general retrieval performance. In order to validate our method’s effectiveness, we also demonstrate superior performance over state-of-the-art methods by performing rigorous experiments on three well-known benchmarks: CUHK-PEDES, ICFG-PEDES, and RSTPReid.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"5147-5157"},"PeriodicalIF":13.7000,"publicationDate":"2025-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Enhancing Text-Based Person Retrieval by Combining Fused Representation and Reciprocal Learning With Adaptive Loss Refinement\",\"authors\":\"Anh D. Nguyen;Hoa N. Nguyen\",\"doi\":\"10.1109/TIP.2025.3594880\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Text-based person retrieval is defined as the challenging task of searching for people’s images based on given textual queries in natural language. Conventional methods primarily use deep neural networks to understand the relationship between visual and textual data, creating a shared feature space for cross-modal matching. The absence of awareness regarding variations in feature granularity between the two modalities, coupled with the diverse poses and viewing angles of images corresponding to the same individual, may lead to overlooking significant differences within each modality and across modalities, despite notable enhancements. Furthermore, the inconsistency in caption queries in large public datasets presents an additional obstacle to cross-modality mapping learning. Therefore, we introduce 3RTPR, a novel text-based person retrieval method that integrates a representation fusing mechanism and an adaptive loss refinement algorithm into a dual-encoder branch architecture. Moreover, we propose training two independent models simultaneously, which reciprocally support each other to enhance learning effectiveness. Consequently, our approach encompasses three significant contributions: (i) proposing a fused representation method to generate more discriminative representations for images and captions; (ii) introducing a novel algorithm to adjust loss and prioritize samples that contain valuable information; and (iii) proposing reciprocal learning involving a pair of independent models, which allows us to enhance general retrieval performance. In order to validate our method’s effectiveness, we also demonstrate superior performance over state-of-the-art methods by performing rigorous experiments on three well-known benchmarks: CUHK-PEDES, ICFG-PEDES, and RSTPReid.\",\"PeriodicalId\":94032,\"journal\":{\"name\":\"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society\",\"volume\":\"34 \",\"pages\":\"5147-5157\"},\"PeriodicalIF\":13.7000,\"publicationDate\":\"2025-08-07\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/11119813/\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/11119813/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Enhancing Text-Based Person Retrieval by Combining Fused Representation and Reciprocal Learning With Adaptive Loss Refinement
Text-based person retrieval is defined as the challenging task of searching for people’s images based on given textual queries in natural language. Conventional methods primarily use deep neural networks to understand the relationship between visual and textual data, creating a shared feature space for cross-modal matching. The absence of awareness regarding variations in feature granularity between the two modalities, coupled with the diverse poses and viewing angles of images corresponding to the same individual, may lead to overlooking significant differences within each modality and across modalities, despite notable enhancements. Furthermore, the inconsistency in caption queries in large public datasets presents an additional obstacle to cross-modality mapping learning. Therefore, we introduce 3RTPR, a novel text-based person retrieval method that integrates a representation fusing mechanism and an adaptive loss refinement algorithm into a dual-encoder branch architecture. Moreover, we propose training two independent models simultaneously, which reciprocally support each other to enhance learning effectiveness. Consequently, our approach encompasses three significant contributions: (i) proposing a fused representation method to generate more discriminative representations for images and captions; (ii) introducing a novel algorithm to adjust loss and prioritize samples that contain valuable information; and (iii) proposing reciprocal learning involving a pair of independent models, which allows us to enhance general retrieval performance. In order to validate our method’s effectiveness, we also demonstrate superior performance over state-of-the-art methods by performing rigorous experiments on three well-known benchmarks: CUHK-PEDES, ICFG-PEDES, and RSTPReid.