{"title":"ViLNM: Visual-Language Noise Modeling for Text-to-Image Person Retrieval","authors":"Guolin Xu;Yong Feng;Yanying Chen;Guofan Duan;Mingliang Zhou","doi":"10.1109/LSP.2025.3553424","DOIUrl":null,"url":null,"abstract":"Text-to-image person retrieval (TPR) focuses on finding a specific person based on the textual description, and most methods implicitly assume the training image-text pairs are correctly aligned. In practice, the image-text pairs exist under-correlated or false-correlated due to the low quality of the images and annotation errors. Meanwhile, remarkable similarities between different person identities may lead to a mismatch between text and image. To tackle the two issues, we present a Visual-Language Noise Modeling (ViLNM) method that successfully captures robust cross-modal associations even with noise. Specifically, we design a Noise Token Aware (NTA) module that eliminates the words in the textual description that do not match the image, utilizing the matched words to establish a more reliable association. Besides, to enhance the recognition ability of the model for different person identities, we propose a Joint Inter and Intra-Modal Contrastive Loss (JII) and Local Aggregation (LA) module to increase the feature differences between different person identities. We conduct comprehensive experiments on three public benchmarks, and ViLNM performs best.","PeriodicalId":13154,"journal":{"name":"IEEE Signal Processing Letters","volume":"32 ","pages":"1386-1390"},"PeriodicalIF":3.2000,"publicationDate":"2025-03-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Signal Processing Letters","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10935662/","RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
引用次数: 0
Abstract
Text-to-image person retrieval (TPR) focuses on finding a specific person based on the textual description, and most methods implicitly assume the training image-text pairs are correctly aligned. In practice, the image-text pairs exist under-correlated or false-correlated due to the low quality of the images and annotation errors. Meanwhile, remarkable similarities between different person identities may lead to a mismatch between text and image. To tackle the two issues, we present a Visual-Language Noise Modeling (ViLNM) method that successfully captures robust cross-modal associations even with noise. Specifically, we design a Noise Token Aware (NTA) module that eliminates the words in the textual description that do not match the image, utilizing the matched words to establish a more reliable association. Besides, to enhance the recognition ability of the model for different person identities, we propose a Joint Inter and Intra-Modal Contrastive Loss (JII) and Local Aggregation (LA) module to increase the feature differences between different person identities. We conduct comprehensive experiments on three public benchmarks, and ViLNM performs best.
期刊介绍:
The IEEE Signal Processing Letters is a monthly, archival publication designed to provide rapid dissemination of original, cutting-edge ideas and timely, significant contributions in signal, image, speech, language and audio processing. Papers published in the Letters can be presented within one year of their appearance in signal processing conferences such as ICASSP, GlobalSIP and ICIP, and also in several workshop organized by the Signal Processing Society.