{"title":"Soft Supervision Guided Spatial-Temporal Refinement Network For Video-based Visible-Infrared Person Re-Identification.","authors":"Jinxing Li, Chuhao Zhou, Rundong Li, Huafeng Li, Xinyu Lin, Guangming Lu, Yong Xu, David Zhang","doi":"10.1109/TIP.2026.3687081","DOIUrl":null,"url":null,"abstract":"<p><p>Thanks to automatic switch between visible and infrared modes, person re-identification (Re-ID) in 24-hour has been possible through cross-modal retrieval. Instead of exploiting still images, video-based cross-modal person Re-ID is studied in this paper. Specifically, a large-scale dataset 'HITSZ-PVCM' is first collected, consisting of as many as 1,681 identities and 839,632 frames. Generally, videos contain much richer pedestrian appearances. However, most existing works only generate temporal representations by whole frames, inevitably losing fine-grained details. Furthermore, training a network by metric losses (e.g., center loss) is a common strategy, while such point-to-point constraints are too strong and limit model generalization due to existing diversity among intra-class samples. Here, we propose a Soft Supervision guided Spatial-Temporal Refinement (S<sup>3</sup>TR) network to tackle these problems. Specifically, S<sup>3</sup>TR refines each frame guided by a coarse temporal feature, so that more discriminative features are extracted and transformed to a sequential representation. Followed by a global-local mutual learning module, the modality gap is then erased without losing fine-grained details. Furthermore, we propose a novel soft-clustering center loss to measure intra-/inter-class similarity/dissimilarity in a group-to-group way, efficiently improving model generalization. To the best of our knowledge, HITSZ-PVCM is the largest dataset and S<sup>3</sup>TR achieves superior performances compared with state-of-the-arts.</p>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"PP ","pages":""},"PeriodicalIF":13.7000,"publicationDate":"2026-04-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/TIP.2026.3687081","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Thanks to automatic switch between visible and infrared modes, person re-identification (Re-ID) in 24-hour has been possible through cross-modal retrieval. Instead of exploiting still images, video-based cross-modal person Re-ID is studied in this paper. Specifically, a large-scale dataset 'HITSZ-PVCM' is first collected, consisting of as many as 1,681 identities and 839,632 frames. Generally, videos contain much richer pedestrian appearances. However, most existing works only generate temporal representations by whole frames, inevitably losing fine-grained details. Furthermore, training a network by metric losses (e.g., center loss) is a common strategy, while such point-to-point constraints are too strong and limit model generalization due to existing diversity among intra-class samples. Here, we propose a Soft Supervision guided Spatial-Temporal Refinement (S3TR) network to tackle these problems. Specifically, S3TR refines each frame guided by a coarse temporal feature, so that more discriminative features are extracted and transformed to a sequential representation. Followed by a global-local mutual learning module, the modality gap is then erased without losing fine-grained details. Furthermore, we propose a novel soft-clustering center loss to measure intra-/inter-class similarity/dissimilarity in a group-to-group way, efficiently improving model generalization. To the best of our knowledge, HITSZ-PVCM is the largest dataset and S3TR achieves superior performances compared with state-of-the-arts.