{"title":"Learning discriminative features via deep metric learning for video-based person re-identification","authors":"Jiahe Wang, Xizhan Gao, Sijie Niu, Hui Zhao, Guang Feng, Jiaxin Lin","doi":"10.1016/j.eswa.2025.128123","DOIUrl":null,"url":null,"abstract":"<div><div>Video-based person re-identification (Video-Re-ID) is a crucial application in practical scenarios and has gradually emerged as one of the most popular research topics in the field of computer vision. Although many efforts have been made, it still remains a challenge due to substantial variations among person videos and even within each video. For Video-Re-ID, we propose a synchronous intrA-video and intEr-video distance metric learning approach based on temporal ViT architecture, termed as TAE-ViT. The TAE-ViT model, in particular consists of a View Guide Patch Embedding (VGPE) module, a Spatial-Temporal Attention (STA) module and an Intra and Inter-video Distance Metric Learning (IIDML) module. The VGPE module is used to utilize diverse view information and extract discriminative features with perspective invariance. The STA module alternately learns the spatial and temporal information by using the spatial and temporal multi-head attention operation, respectively. The IIDML module simultaneously captures intra-video and inter-video distance metrics from the training videos. Specifically, the intra-video distance metric aims to compact each video representation, while the inter-video distance metric ensures that truly matched videos are closer in distance compared to incorrectly matched ones. Experimental results show that our method achieves the best mAP (86.7 %, 96.3 %, 97.3 %) on three public Video-ReID datasets and achieves the best Rank-1 (93.3 %, 96.6 %) on two datasets, reducing the error of state-of-the-art methods by 1.48-45.16 % on mAP and by 2.86 % on Rank-1. Its robust performance highlights its potential for real-world applications like intelligent surveillance and public safety. Our code will be available at <span><span>https://github.com/JingShenZhuangTaiYiChang/TAE-ViT</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50461,"journal":{"name":"Expert Systems with Applications","volume":"286 ","pages":"Article 128123"},"PeriodicalIF":7.5000,"publicationDate":"2025-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Expert Systems with Applications","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0957417425017440","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Video-based person re-identification (Video-Re-ID) is a crucial application in practical scenarios and has gradually emerged as one of the most popular research topics in the field of computer vision. Although many efforts have been made, it still remains a challenge due to substantial variations among person videos and even within each video. For Video-Re-ID, we propose a synchronous intrA-video and intEr-video distance metric learning approach based on temporal ViT architecture, termed as TAE-ViT. The TAE-ViT model, in particular consists of a View Guide Patch Embedding (VGPE) module, a Spatial-Temporal Attention (STA) module and an Intra and Inter-video Distance Metric Learning (IIDML) module. The VGPE module is used to utilize diverse view information and extract discriminative features with perspective invariance. The STA module alternately learns the spatial and temporal information by using the spatial and temporal multi-head attention operation, respectively. The IIDML module simultaneously captures intra-video and inter-video distance metrics from the training videos. Specifically, the intra-video distance metric aims to compact each video representation, while the inter-video distance metric ensures that truly matched videos are closer in distance compared to incorrectly matched ones. Experimental results show that our method achieves the best mAP (86.7 %, 96.3 %, 97.3 %) on three public Video-ReID datasets and achieves the best Rank-1 (93.3 %, 96.6 %) on two datasets, reducing the error of state-of-the-art methods by 1.48-45.16 % on mAP and by 2.86 % on Rank-1. Its robust performance highlights its potential for real-world applications like intelligent surveillance and public safety. Our code will be available at https://github.com/JingShenZhuangTaiYiChang/TAE-ViT.
期刊介绍:
Expert Systems With Applications is an international journal dedicated to the exchange of information on expert and intelligent systems used globally in industry, government, and universities. The journal emphasizes original papers covering the design, development, testing, implementation, and management of these systems, offering practical guidelines. It spans various sectors such as finance, engineering, marketing, law, project management, information management, medicine, and more. The journal also welcomes papers on multi-agent systems, knowledge management, neural networks, knowledge discovery, data mining, and other related areas, excluding applications to military/defense systems.