基于clip的可见-红外图像再识别模态补偿

IF 8.4 1区 计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS
Gang Hu;Yafei Lv;Jianting Zhang;Qian Wu;Zaidao Wen
{"title":"基于clip的可见-红外图像再识别模态补偿","authors":"Gang Hu;Yafei Lv;Jianting Zhang;Qian Wu;Zaidao Wen","doi":"10.1109/TMM.2024.3521764","DOIUrl":null,"url":null,"abstract":"Visible-infrared image re-identification (VIReID) aims to match objects with the same identity appearing across different modalities. Given the significant differences between visible and infrared images, VIReID poses a formidable challenge. Most existing methods focus on extracting modality-shared features while ignore modality-specific features, which often also contain crucial important discriminative information. In addition, high-level semantic information of the objects, such as shape and appearance, is also crucial for the VIReID task. To further enhance the retrieval performance, we propose a novel one-stage CLIP-based Modality Compensation (CLIP-MC) method for the VIReID task. Our method introduces a new prompt learning paradigm that leverages the semantic understanding capabilities of CLIP to recover missing modality information. CLIP-MC comprises three key modules: Instance Text Prompt Generation (ITPG), Modality Compensation (MC), and Modality Context Learner (MCL). Specifically, the ITPG module facilitates effective alignment and interaction between image tokens and text tokens, enhancing the text encoder's ability to capture detailed visual information from the images. This ensures that the text encoder generates fine-grained descriptions of the images. The MCL module captures the unique information of each modality and generates modality-specific context tokens, which are more flexible compared to fixed text descriptions. Guided by the modality-specific context, the text encoder discovers missing modality information from the images and produces compensated modality features. Finally, the MC module combines the original and compensated modality features to obtain complete modality features that contain more discriminative information. We conduct extensive experiments on three VIReID datasets and compare the performance of our method with other existing approaches to demonstrate its effectiveness and superiority.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"2112-2126"},"PeriodicalIF":8.4000,"publicationDate":"2024-12-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"CLIP-Based Modality Compensation for Visible-Infrared Image Re-Identification\",\"authors\":\"Gang Hu;Yafei Lv;Jianting Zhang;Qian Wu;Zaidao Wen\",\"doi\":\"10.1109/TMM.2024.3521764\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Visible-infrared image re-identification (VIReID) aims to match objects with the same identity appearing across different modalities. Given the significant differences between visible and infrared images, VIReID poses a formidable challenge. Most existing methods focus on extracting modality-shared features while ignore modality-specific features, which often also contain crucial important discriminative information. In addition, high-level semantic information of the objects, such as shape and appearance, is also crucial for the VIReID task. To further enhance the retrieval performance, we propose a novel one-stage CLIP-based Modality Compensation (CLIP-MC) method for the VIReID task. Our method introduces a new prompt learning paradigm that leverages the semantic understanding capabilities of CLIP to recover missing modality information. CLIP-MC comprises three key modules: Instance Text Prompt Generation (ITPG), Modality Compensation (MC), and Modality Context Learner (MCL). Specifically, the ITPG module facilitates effective alignment and interaction between image tokens and text tokens, enhancing the text encoder's ability to capture detailed visual information from the images. This ensures that the text encoder generates fine-grained descriptions of the images. The MCL module captures the unique information of each modality and generates modality-specific context tokens, which are more flexible compared to fixed text descriptions. Guided by the modality-specific context, the text encoder discovers missing modality information from the images and produces compensated modality features. Finally, the MC module combines the original and compensated modality features to obtain complete modality features that contain more discriminative information. We conduct extensive experiments on three VIReID datasets and compare the performance of our method with other existing approaches to demonstrate its effectiveness and superiority.\",\"PeriodicalId\":13273,\"journal\":{\"name\":\"IEEE Transactions on Multimedia\",\"volume\":\"27 \",\"pages\":\"2112-2126\"},\"PeriodicalIF\":8.4000,\"publicationDate\":\"2024-12-25\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Multimedia\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10814673/\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Multimedia","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10814673/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0

摘要

可见-红外图像再识别(VIReID)的目的是匹配不同模态中出现的具有相同身份的物体。鉴于可见光和红外图像之间的显著差异,VIReID提出了一个艰巨的挑战。大多数现有方法都侧重于提取模态共享特征,而忽略了模态特定特征,这些特征通常也包含至关重要的判别信息。此外,物体的高级语义信息,如形状和外观,也对virid任务至关重要。为了进一步提高检索性能,我们提出了一种基于单阶段clip模态补偿(CLIP-MC)的virid任务方法。我们的方法引入了一种新的提示学习范式,利用CLIP的语义理解能力来恢复缺失的情态信息。CLIP-MC包括三个关键模块:实例文本提示生成(ITPG)、情态补偿(MC)和情态上下文学习(MCL)。具体来说,ITPG模块促进了图像标记和文本标记之间的有效对齐和交互,增强了文本编码器从图像中捕获详细视觉信息的能力。这确保了文本编码器生成图像的细粒度描述。MCL模块捕获每个模态的唯一信息,并生成特定于模态的上下文令牌,与固定的文本描述相比,这些令牌更加灵活。在模态特定上下文的引导下,文本编码器从图像中发现缺失的模态信息,并产生补偿的模态特征。最后,MC模块结合原始模态特征和补偿模态特征,得到包含更多判别信息的完整模态特征。我们在三个virid数据集上进行了大量的实验,并将我们的方法与其他现有方法的性能进行了比较,以证明其有效性和优越性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
CLIP-Based Modality Compensation for Visible-Infrared Image Re-Identification
Visible-infrared image re-identification (VIReID) aims to match objects with the same identity appearing across different modalities. Given the significant differences between visible and infrared images, VIReID poses a formidable challenge. Most existing methods focus on extracting modality-shared features while ignore modality-specific features, which often also contain crucial important discriminative information. In addition, high-level semantic information of the objects, such as shape and appearance, is also crucial for the VIReID task. To further enhance the retrieval performance, we propose a novel one-stage CLIP-based Modality Compensation (CLIP-MC) method for the VIReID task. Our method introduces a new prompt learning paradigm that leverages the semantic understanding capabilities of CLIP to recover missing modality information. CLIP-MC comprises three key modules: Instance Text Prompt Generation (ITPG), Modality Compensation (MC), and Modality Context Learner (MCL). Specifically, the ITPG module facilitates effective alignment and interaction between image tokens and text tokens, enhancing the text encoder's ability to capture detailed visual information from the images. This ensures that the text encoder generates fine-grained descriptions of the images. The MCL module captures the unique information of each modality and generates modality-specific context tokens, which are more flexible compared to fixed text descriptions. Guided by the modality-specific context, the text encoder discovers missing modality information from the images and produces compensated modality features. Finally, the MC module combines the original and compensated modality features to obtain complete modality features that contain more discriminative information. We conduct extensive experiments on three VIReID datasets and compare the performance of our method with other existing approaches to demonstrate its effectiveness and superiority.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
IEEE Transactions on Multimedia
IEEE Transactions on Multimedia 工程技术-电信学
CiteScore
11.70
自引率
11.00%
发文量
576
审稿时长
5.5 months
期刊介绍: The IEEE Transactions on Multimedia delves into diverse aspects of multimedia technology and applications, covering circuits, networking, signal processing, systems, software, and systems integration. The scope aligns with the Fields of Interest of the sponsors, ensuring a comprehensive exploration of research in multimedia.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信