Semantic consistency learning for unsupervised multi-modal person re-identification

IF 4.2 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Image and Vision Computing Pub Date : 2025-02-13 DOI:10.1016/j.imavis.2025.105434

Yuxin Zhang, Zhu Teng, Baopeng Zhang

{"title":"Semantic consistency learning for unsupervised multi-modal person re-identification","authors":"Yuxin Zhang, Zhu Teng, Baopeng Zhang","doi":"10.1016/j.imavis.2025.105434","DOIUrl":null,"url":null,"abstract":"<div><div>Unsupervised multi-modal person re-identification poses significant challenges due to the substantial modality gap and the absence of annotations. Although previous efforts have aimed to bridge this gap by establishing modality correspondences, their focus has been confined to the feature and image level correspondences, neglecting full utilization of semantic information. To tackle these issues, we propose a Semantic Consistency Learning Network (SCLNet) for unsupervised multi-modal person re-identification. SCLNet first predicts pseudo-labels using a hierarchical clustering algorithm, which capitalizes on common semantics to perform mutual refinement across modalities and establishes cross-modality label correspondences based on semantic analysis. Besides, we also design a cross-modality loss that utilizes contrastive learning to acquire modality-invariant features, effectively reducing the inter-modality gap and enhancing the robustness of the model. Furthermore, we construct a new multi-modality dataset named Subway-TM. This dataset not only encompasses visible and infrared modalities but also includes a depth modality, captured by three cameras across 266 identities, comprising 10,645 RGB images, 10,529 infrared images, and 10,529 depth images. To the best of our knowledge, this is the first person re-identification dataset with three modalities. We conduct extensive experiments, utilizing the widely employed person re-identification datasets SYSU-MM01 and RegDB, along with our newly proposed multi-modal Subway-TM dataset. The experimental results show that our proposed method is promising compared to the current state-of-the-art methods.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"155 ","pages":"Article 105434"},"PeriodicalIF":4.2000,"publicationDate":"2025-02-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Image and Vision Computing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0262885625000228","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Unsupervised multi-modal person re-identification poses significant challenges due to the substantial modality gap and the absence of annotations. Although previous efforts have aimed to bridge this gap by establishing modality correspondences, their focus has been confined to the feature and image level correspondences, neglecting full utilization of semantic information. To tackle these issues, we propose a Semantic Consistency Learning Network (SCLNet) for unsupervised multi-modal person re-identification. SCLNet first predicts pseudo-labels using a hierarchical clustering algorithm, which capitalizes on common semantics to perform mutual refinement across modalities and establishes cross-modality label correspondences based on semantic analysis. Besides, we also design a cross-modality loss that utilizes contrastive learning to acquire modality-invariant features, effectively reducing the inter-modality gap and enhancing the robustness of the model. Furthermore, we construct a new multi-modality dataset named Subway-TM. This dataset not only encompasses visible and infrared modalities but also includes a depth modality, captured by three cameras across 266 identities, comprising 10,645 RGB images, 10,529 infrared images, and 10,529 depth images. To the best of our knowledge, this is the first person re-identification dataset with three modalities. We conduct extensive experiments, utilizing the widely employed person re-identification datasets SYSU-MM01 and RegDB, along with our newly proposed multi-modal Subway-TM dataset. The experimental results show that our proposed method is promising compared to the current state-of-the-art methods.

查看原文本刊更多论文

无监督多模态人再识别的语义一致性学习

由于存在大量的模态差距和缺少注释，无监督的多模态人再识别提出了重大挑战。虽然以前的研究都试图通过建立情态对应来弥补这一差距，但他们的重点一直局限于特征和图像级的对应，而忽略了对语义信息的充分利用。为了解决这些问题，我们提出了一个用于无监督多模态人再识别的语义一致性学习网络（SCLNet）。SCLNet首先使用分层聚类算法预测伪标签，该算法利用公共语义跨模态执行相互细化，并基于语义分析建立跨模态标签对应。此外，我们还设计了一个跨模态损失，利用对比学习来获取模态不变特征，有效地减少了模态间的差距，增强了模型的鲁棒性。此外，我们构建了一个新的多模态数据集，命名为Subway-TM。该数据集不仅包括可见光和红外模式，还包括深度模式，由266个身份的三台相机捕获，包括10,645个RGB图像，10,529个红外图像和10,529个深度图像。据我们所知，这是第一个具有三种模式的人再识别数据集。我们进行了广泛的实验，利用广泛使用的人员再识别数据集SYSU-MM01和RegDB，以及我们新提出的多模式地铁tm数据集。实验结果表明，与目前最先进的方法相比，我们提出的方法是有希望的。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Image and Vision Computing 工程技术-工程：电子与电气

CiteScore

8.50

自引率

8.50%

发文量

143

审稿时长

7.8 months

期刊介绍： Image and Vision Computing has as a primary aim the provision of an effective medium of interchange for the results of high quality theoretical and applied research fundamental to all aspects of image interpretation and computer vision. The journal publishes work that proposes new image interpretation and computer vision methodology or addresses the application of such methods to real world scenes. It seeks to strengthen a deeper understanding in the discipline by encouraging the quantitative comparison and performance evaluation of the proposed methodology. The coverage includes: image interpretation, scene modelling, object recognition and tracking, shape analysis, monitoring and surveillance, active vision and robotic systems, SLAM, biologically-inspired computer vision, motion analysis, stereo vision, document image understanding, character and handwritten text recognition, face and gesture recognition, biometrics, vision-based human-computer interaction, human activity and behavior understanding, data fusion from multiple sensor inputs, image databases.