Gang Hu;Yafei Lv;Jianting Zhang;Qian Wu;Zaidao Wen
{"title":"CLIP-Based Modality Compensation for Visible-Infrared Image Re-Identification","authors":"Gang Hu;Yafei Lv;Jianting Zhang;Qian Wu;Zaidao Wen","doi":"10.1109/TMM.2024.3521764","DOIUrl":null,"url":null,"abstract":"Visible-infrared image re-identification (VIReID) aims to match objects with the same identity appearing across different modalities. Given the significant differences between visible and infrared images, VIReID poses a formidable challenge. Most existing methods focus on extracting modality-shared features while ignore modality-specific features, which often also contain crucial important discriminative information. In addition, high-level semantic information of the objects, such as shape and appearance, is also crucial for the VIReID task. To further enhance the retrieval performance, we propose a novel one-stage CLIP-based Modality Compensation (CLIP-MC) method for the VIReID task. Our method introduces a new prompt learning paradigm that leverages the semantic understanding capabilities of CLIP to recover missing modality information. CLIP-MC comprises three key modules: Instance Text Prompt Generation (ITPG), Modality Compensation (MC), and Modality Context Learner (MCL). Specifically, the ITPG module facilitates effective alignment and interaction between image tokens and text tokens, enhancing the text encoder's ability to capture detailed visual information from the images. This ensures that the text encoder generates fine-grained descriptions of the images. The MCL module captures the unique information of each modality and generates modality-specific context tokens, which are more flexible compared to fixed text descriptions. Guided by the modality-specific context, the text encoder discovers missing modality information from the images and produces compensated modality features. Finally, the MC module combines the original and compensated modality features to obtain complete modality features that contain more discriminative information. We conduct extensive experiments on three VIReID datasets and compare the performance of our method with other existing approaches to demonstrate its effectiveness and superiority.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"2112-2126"},"PeriodicalIF":8.4000,"publicationDate":"2024-12-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Multimedia","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10814673/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0
Abstract
Visible-infrared image re-identification (VIReID) aims to match objects with the same identity appearing across different modalities. Given the significant differences between visible and infrared images, VIReID poses a formidable challenge. Most existing methods focus on extracting modality-shared features while ignore modality-specific features, which often also contain crucial important discriminative information. In addition, high-level semantic information of the objects, such as shape and appearance, is also crucial for the VIReID task. To further enhance the retrieval performance, we propose a novel one-stage CLIP-based Modality Compensation (CLIP-MC) method for the VIReID task. Our method introduces a new prompt learning paradigm that leverages the semantic understanding capabilities of CLIP to recover missing modality information. CLIP-MC comprises three key modules: Instance Text Prompt Generation (ITPG), Modality Compensation (MC), and Modality Context Learner (MCL). Specifically, the ITPG module facilitates effective alignment and interaction between image tokens and text tokens, enhancing the text encoder's ability to capture detailed visual information from the images. This ensures that the text encoder generates fine-grained descriptions of the images. The MCL module captures the unique information of each modality and generates modality-specific context tokens, which are more flexible compared to fixed text descriptions. Guided by the modality-specific context, the text encoder discovers missing modality information from the images and produces compensated modality features. Finally, the MC module combines the original and compensated modality features to obtain complete modality features that contain more discriminative information. We conduct extensive experiments on three VIReID datasets and compare the performance of our method with other existing approaches to demonstrate its effectiveness and superiority.
期刊介绍:
The IEEE Transactions on Multimedia delves into diverse aspects of multimedia technology and applications, covering circuits, networking, signal processing, systems, software, and systems integration. The scope aligns with the Fields of Interest of the sponsors, ensuring a comprehensive exploration of research in multimedia.