Soft Supervision Guided Spatial-Temporal Refinement Network For Video-based Visible-Infrared Person Re-Identification.

IF 13.7

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society Pub Date : 2026-04-29 DOI:10.1109/TIP.2026.3687081

Jinxing Li, Chuhao Zhou, Rundong Li, Huafeng Li, Xinyu Lin, Guangming Lu, Yong Xu, David Zhang

{"title":"Soft Supervision Guided Spatial-Temporal Refinement Network For Video-based Visible-Infrared Person Re-Identification.","authors":"Jinxing Li, Chuhao Zhou, Rundong Li, Huafeng Li, Xinyu Lin, Guangming Lu, Yong Xu, David Zhang","doi":"10.1109/TIP.2026.3687081","DOIUrl":null,"url":null,"abstract":"Thanks to automatic switch between visible and infrared modes, person re-identification (Re-ID) in 24-hour has been possible through cross-modal retrieval. Instead of exploiting still images, video-based cross-modal person Re-ID is studied in this paper. Specifically, a large-scale dataset 'HITSZ-PVCM' is first collected, consisting of as many as 1,681 identities and 839,632 frames. Generally, videos contain much richer pedestrian appearances. However, most existing works only generate temporal representations by whole frames, inevitably losing fine-grained details. Furthermore, training a network by metric losses (e.g., center loss) is a common strategy, while such point-to-point constraints are too strong and limit model generalization due to existing diversity among intra-class samples. Here, we propose a Soft Supervision guided Spatial-Temporal Refinement (S3TR) network to tackle these problems. Specifically, S3TR refines each frame guided by a coarse temporal feature, so that more discriminative features are extracted and transformed to a sequential representation. Followed by a global-local mutual learning module, the modality gap is then erased without losing fine-grained details. Furthermore, we propose a novel soft-clustering center loss to measure intra-/inter-class similarity/dissimilarity in a group-to-group way, efficiently improving model generalization. To the best of our knowledge, HITSZ-PVCM is the largest dataset and S3TR achieves superior performances compared with state-of-the-arts.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"PP ","pages":""},"PeriodicalIF":13.7000,"publicationDate":"2026-04-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/TIP.2026.3687081","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Thanks to automatic switch between visible and infrared modes, person re-identification (Re-ID) in 24-hour has been possible through cross-modal retrieval. Instead of exploiting still images, video-based cross-modal person Re-ID is studied in this paper. Specifically, a large-scale dataset 'HITSZ-PVCM' is first collected, consisting of as many as 1,681 identities and 839,632 frames. Generally, videos contain much richer pedestrian appearances. However, most existing works only generate temporal representations by whole frames, inevitably losing fine-grained details. Furthermore, training a network by metric losses (e.g., center loss) is a common strategy, while such point-to-point constraints are too strong and limit model generalization due to existing diversity among intra-class samples. Here, we propose a Soft Supervision guided Spatial-Temporal Refinement (S³TR) network to tackle these problems. Specifically, S³TR refines each frame guided by a coarse temporal feature, so that more discriminative features are extracted and transformed to a sequential representation. Followed by a global-local mutual learning module, the modality gap is then erased without losing fine-grained details. Furthermore, we propose a novel soft-clustering center loss to measure intra-/inter-class similarity/dissimilarity in a group-to-group way, efficiently improving model generalization. To the best of our knowledge, HITSZ-PVCM is the largest dataset and S³TR achieves superior performances compared with state-of-the-arts.

查看原文本刊更多论文

基于视频的可见-红外人物再识别的软监督引导时空细化网络。

由于在可见光和红外模式之间自动切换，通过跨模式检索，可以在24小时内重新识别（Re-ID）。本文研究了基于视频的跨模态人身份识别，而不是利用静止图像。具体而言，首先收集了一个大规模数据集“HITSZ-PVCM”，包含多达1,681个身份和839,632帧。一般来说，视频包含更丰富的行人外观。然而，现有的大多数作品仅通过整个框架生成时间表征，不可避免地失去了细粒度的细节。此外，通过度量损失（例如中心损失）来训练网络是一种常见的策略，但由于类内样本之间存在多样性，这种点对点约束太强，限制了模型的泛化。在此，我们提出了一个软监督引导的时空细化（S3TR）网络来解决这些问题。具体来说，S3TR通过粗糙的时间特征对每个帧进行细化，从而提取更多的判别特征并将其转换为顺序表示。然后是全局-局部相互学习模块，然后在不丢失细粒度细节的情况下消除模态差距。此外，我们提出了一种新的软聚类中心损失，以组对组的方式测量类内/类间的相似性/不相似性，有效地提高了模型的泛化能力。据我们所知，HITSZ-PVCM是最大的数据集，与最先进的数据集相比，S3TR实现了卓越的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society

自引率

0.00%

发文量