无监督多模态人再识别的语义一致性学习

IF 4.2 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE
Yuxin Zhang, Zhu Teng, Baopeng Zhang
{"title":"无监督多模态人再识别的语义一致性学习","authors":"Yuxin Zhang,&nbsp;Zhu Teng,&nbsp;Baopeng Zhang","doi":"10.1016/j.imavis.2025.105434","DOIUrl":null,"url":null,"abstract":"<div><div>Unsupervised multi-modal person re-identification poses significant challenges due to the substantial modality gap and the absence of annotations. Although previous efforts have aimed to bridge this gap by establishing modality correspondences, their focus has been confined to the feature and image level correspondences, neglecting full utilization of semantic information. To tackle these issues, we propose a Semantic Consistency Learning Network (SCLNet) for unsupervised multi-modal person re-identification. SCLNet first predicts pseudo-labels using a hierarchical clustering algorithm, which capitalizes on common semantics to perform mutual refinement across modalities and establishes cross-modality label correspondences based on semantic analysis. Besides, we also design a cross-modality loss that utilizes contrastive learning to acquire modality-invariant features, effectively reducing the inter-modality gap and enhancing the robustness of the model. Furthermore, we construct a new multi-modality dataset named Subway-TM. This dataset not only encompasses visible and infrared modalities but also includes a depth modality, captured by three cameras across 266 identities, comprising 10,645 RGB images, 10,529 infrared images, and 10,529 depth images. To the best of our knowledge, this is the first person re-identification dataset with three modalities. We conduct extensive experiments, utilizing the widely employed person re-identification datasets SYSU-MM01 and RegDB, along with our newly proposed multi-modal Subway-TM dataset. The experimental results show that our proposed method is promising compared to the current state-of-the-art methods.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"155 ","pages":"Article 105434"},"PeriodicalIF":4.2000,"publicationDate":"2025-02-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Semantic consistency learning for unsupervised multi-modal person re-identification\",\"authors\":\"Yuxin Zhang,&nbsp;Zhu Teng,&nbsp;Baopeng Zhang\",\"doi\":\"10.1016/j.imavis.2025.105434\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Unsupervised multi-modal person re-identification poses significant challenges due to the substantial modality gap and the absence of annotations. Although previous efforts have aimed to bridge this gap by establishing modality correspondences, their focus has been confined to the feature and image level correspondences, neglecting full utilization of semantic information. To tackle these issues, we propose a Semantic Consistency Learning Network (SCLNet) for unsupervised multi-modal person re-identification. SCLNet first predicts pseudo-labels using a hierarchical clustering algorithm, which capitalizes on common semantics to perform mutual refinement across modalities and establishes cross-modality label correspondences based on semantic analysis. Besides, we also design a cross-modality loss that utilizes contrastive learning to acquire modality-invariant features, effectively reducing the inter-modality gap and enhancing the robustness of the model. Furthermore, we construct a new multi-modality dataset named Subway-TM. This dataset not only encompasses visible and infrared modalities but also includes a depth modality, captured by three cameras across 266 identities, comprising 10,645 RGB images, 10,529 infrared images, and 10,529 depth images. To the best of our knowledge, this is the first person re-identification dataset with three modalities. We conduct extensive experiments, utilizing the widely employed person re-identification datasets SYSU-MM01 and RegDB, along with our newly proposed multi-modal Subway-TM dataset. The experimental results show that our proposed method is promising compared to the current state-of-the-art methods.</div></div>\",\"PeriodicalId\":50374,\"journal\":{\"name\":\"Image and Vision Computing\",\"volume\":\"155 \",\"pages\":\"Article 105434\"},\"PeriodicalIF\":4.2000,\"publicationDate\":\"2025-02-13\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Image and Vision Computing\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0262885625000228\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Image and Vision Computing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0262885625000228","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0

摘要

由于存在大量的模态差距和缺少注释,无监督的多模态人再识别提出了重大挑战。虽然以前的研究都试图通过建立情态对应来弥补这一差距,但他们的重点一直局限于特征和图像级的对应,而忽略了对语义信息的充分利用。为了解决这些问题,我们提出了一个用于无监督多模态人再识别的语义一致性学习网络(SCLNet)。SCLNet首先使用分层聚类算法预测伪标签,该算法利用公共语义跨模态执行相互细化,并基于语义分析建立跨模态标签对应。此外,我们还设计了一个跨模态损失,利用对比学习来获取模态不变特征,有效地减少了模态间的差距,增强了模型的鲁棒性。此外,我们构建了一个新的多模态数据集,命名为Subway-TM。该数据集不仅包括可见光和红外模式,还包括深度模式,由266个身份的三台相机捕获,包括10,645个RGB图像,10,529个红外图像和10,529个深度图像。据我们所知,这是第一个具有三种模式的人再识别数据集。我们进行了广泛的实验,利用广泛使用的人员再识别数据集SYSU-MM01和RegDB,以及我们新提出的多模式地铁tm数据集。实验结果表明,与目前最先进的方法相比,我们提出的方法是有希望的。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Semantic consistency learning for unsupervised multi-modal person re-identification
Unsupervised multi-modal person re-identification poses significant challenges due to the substantial modality gap and the absence of annotations. Although previous efforts have aimed to bridge this gap by establishing modality correspondences, their focus has been confined to the feature and image level correspondences, neglecting full utilization of semantic information. To tackle these issues, we propose a Semantic Consistency Learning Network (SCLNet) for unsupervised multi-modal person re-identification. SCLNet first predicts pseudo-labels using a hierarchical clustering algorithm, which capitalizes on common semantics to perform mutual refinement across modalities and establishes cross-modality label correspondences based on semantic analysis. Besides, we also design a cross-modality loss that utilizes contrastive learning to acquire modality-invariant features, effectively reducing the inter-modality gap and enhancing the robustness of the model. Furthermore, we construct a new multi-modality dataset named Subway-TM. This dataset not only encompasses visible and infrared modalities but also includes a depth modality, captured by three cameras across 266 identities, comprising 10,645 RGB images, 10,529 infrared images, and 10,529 depth images. To the best of our knowledge, this is the first person re-identification dataset with three modalities. We conduct extensive experiments, utilizing the widely employed person re-identification datasets SYSU-MM01 and RegDB, along with our newly proposed multi-modal Subway-TM dataset. The experimental results show that our proposed method is promising compared to the current state-of-the-art methods.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Image and Vision Computing
Image and Vision Computing 工程技术-工程:电子与电气
CiteScore
8.50
自引率
8.50%
发文量
143
审稿时长
7.8 months
期刊介绍: Image and Vision Computing has as a primary aim the provision of an effective medium of interchange for the results of high quality theoretical and applied research fundamental to all aspects of image interpretation and computer vision. The journal publishes work that proposes new image interpretation and computer vision methodology or addresses the application of such methods to real world scenes. It seeks to strengthen a deeper understanding in the discipline by encouraging the quantitative comparison and performance evaluation of the proposed methodology. The coverage includes: image interpretation, scene modelling, object recognition and tracking, shape analysis, monitoring and surveillance, active vision and robotic systems, SLAM, biologically-inspired computer vision, motion analysis, stereo vision, document image understanding, character and handwritten text recognition, face and gesture recognition, biometrics, vision-based human-computer interaction, human activity and behavior understanding, data fusion from multiple sensor inputs, image databases.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信