Soft Supervision Guided Spatial-Temporal Refinement Network For Video-based Visible-Infrared Person Re-Identification.

IF 13.7
Jinxing Li, Chuhao Zhou, Rundong Li, Huafeng Li, Xinyu Lin, Guangming Lu, Yong Xu, David Zhang
{"title":"Soft Supervision Guided Spatial-Temporal Refinement Network For Video-based Visible-Infrared Person Re-Identification.","authors":"Jinxing Li, Chuhao Zhou, Rundong Li, Huafeng Li, Xinyu Lin, Guangming Lu, Yong Xu, David Zhang","doi":"10.1109/TIP.2026.3687081","DOIUrl":null,"url":null,"abstract":"<p><p>Thanks to automatic switch between visible and infrared modes, person re-identification (Re-ID) in 24-hour has been possible through cross-modal retrieval. Instead of exploiting still images, video-based cross-modal person Re-ID is studied in this paper. Specifically, a large-scale dataset 'HITSZ-PVCM' is first collected, consisting of as many as 1,681 identities and 839,632 frames. Generally, videos contain much richer pedestrian appearances. However, most existing works only generate temporal representations by whole frames, inevitably losing fine-grained details. Furthermore, training a network by metric losses (e.g., center loss) is a common strategy, while such point-to-point constraints are too strong and limit model generalization due to existing diversity among intra-class samples. Here, we propose a Soft Supervision guided Spatial-Temporal Refinement (S<sup>3</sup>TR) network to tackle these problems. Specifically, S<sup>3</sup>TR refines each frame guided by a coarse temporal feature, so that more discriminative features are extracted and transformed to a sequential representation. Followed by a global-local mutual learning module, the modality gap is then erased without losing fine-grained details. Furthermore, we propose a novel soft-clustering center loss to measure intra-/inter-class similarity/dissimilarity in a group-to-group way, efficiently improving model generalization. To the best of our knowledge, HITSZ-PVCM is the largest dataset and S<sup>3</sup>TR achieves superior performances compared with state-of-the-arts.</p>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"PP ","pages":""},"PeriodicalIF":13.7000,"publicationDate":"2026-04-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/TIP.2026.3687081","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Thanks to automatic switch between visible and infrared modes, person re-identification (Re-ID) in 24-hour has been possible through cross-modal retrieval. Instead of exploiting still images, video-based cross-modal person Re-ID is studied in this paper. Specifically, a large-scale dataset 'HITSZ-PVCM' is first collected, consisting of as many as 1,681 identities and 839,632 frames. Generally, videos contain much richer pedestrian appearances. However, most existing works only generate temporal representations by whole frames, inevitably losing fine-grained details. Furthermore, training a network by metric losses (e.g., center loss) is a common strategy, while such point-to-point constraints are too strong and limit model generalization due to existing diversity among intra-class samples. Here, we propose a Soft Supervision guided Spatial-Temporal Refinement (S3TR) network to tackle these problems. Specifically, S3TR refines each frame guided by a coarse temporal feature, so that more discriminative features are extracted and transformed to a sequential representation. Followed by a global-local mutual learning module, the modality gap is then erased without losing fine-grained details. Furthermore, we propose a novel soft-clustering center loss to measure intra-/inter-class similarity/dissimilarity in a group-to-group way, efficiently improving model generalization. To the best of our knowledge, HITSZ-PVCM is the largest dataset and S3TR achieves superior performances compared with state-of-the-arts.

基于视频的可见-红外人物再识别的软监督引导时空细化网络。
由于在可见光和红外模式之间自动切换,通过跨模式检索,可以在24小时内重新识别(Re-ID)。本文研究了基于视频的跨模态人身份识别,而不是利用静止图像。具体而言,首先收集了一个大规模数据集“HITSZ-PVCM”,包含多达1,681个身份和839,632帧。一般来说,视频包含更丰富的行人外观。然而,现有的大多数作品仅通过整个框架生成时间表征,不可避免地失去了细粒度的细节。此外,通过度量损失(例如中心损失)来训练网络是一种常见的策略,但由于类内样本之间存在多样性,这种点对点约束太强,限制了模型的泛化。在此,我们提出了一个软监督引导的时空细化(S3TR)网络来解决这些问题。具体来说,S3TR通过粗糙的时间特征对每个帧进行细化,从而提取更多的判别特征并将其转换为顺序表示。然后是全局-局部相互学习模块,然后在不丢失细粒度细节的情况下消除模态差距。此外,我们提出了一种新的软聚类中心损失,以组对组的方式测量类内/类间的相似性/不相似性,有效地提高了模型的泛化能力。据我们所知,HITSZ-PVCM是最大的数据集,与最先进的数据集相比,S3TR实现了卓越的性能。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信
小红书