基于深度度量学习的视频人物再识别判别特征学习

IF 7.5 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE
Jiahe Wang, Xizhan Gao, Sijie Niu, Hui Zhao, Guang Feng, Jiaxin Lin
{"title":"基于深度度量学习的视频人物再识别判别特征学习","authors":"Jiahe Wang,&nbsp;Xizhan Gao,&nbsp;Sijie Niu,&nbsp;Hui Zhao,&nbsp;Guang Feng,&nbsp;Jiaxin Lin","doi":"10.1016/j.eswa.2025.128123","DOIUrl":null,"url":null,"abstract":"<div><div>Video-based person re-identification (Video-Re-ID) is a crucial application in practical scenarios and has gradually emerged as one of the most popular research topics in the field of computer vision. Although many efforts have been made, it still remains a challenge due to substantial variations among person videos and even within each video. For Video-Re-ID, we propose a synchronous intrA-video and intEr-video distance metric learning approach based on temporal ViT architecture, termed as TAE-ViT. The TAE-ViT model, in particular consists of a View Guide Patch Embedding (VGPE) module, a Spatial-Temporal Attention (STA) module and an Intra and Inter-video Distance Metric Learning (IIDML) module. The VGPE module is used to utilize diverse view information and extract discriminative features with perspective invariance. The STA module alternately learns the spatial and temporal information by using the spatial and temporal multi-head attention operation, respectively. The IIDML module simultaneously captures intra-video and inter-video distance metrics from the training videos. Specifically, the intra-video distance metric aims to compact each video representation, while the inter-video distance metric ensures that truly matched videos are closer in distance compared to incorrectly matched ones. Experimental results show that our method achieves the best mAP (86.7 %, 96.3 %, 97.3 %) on three public Video-ReID datasets and achieves the best Rank-1 (93.3 %, 96.6 %) on two datasets, reducing the error of state-of-the-art methods by 1.48-45.16 % on mAP and by 2.86 % on Rank-1. Its robust performance highlights its potential for real-world applications like intelligent surveillance and public safety. Our code will be available at <span><span>https://github.com/JingShenZhuangTaiYiChang/TAE-ViT</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50461,"journal":{"name":"Expert Systems with Applications","volume":"286 ","pages":"Article 128123"},"PeriodicalIF":7.5000,"publicationDate":"2025-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Learning discriminative features via deep metric learning for video-based person re-identification\",\"authors\":\"Jiahe Wang,&nbsp;Xizhan Gao,&nbsp;Sijie Niu,&nbsp;Hui Zhao,&nbsp;Guang Feng,&nbsp;Jiaxin Lin\",\"doi\":\"10.1016/j.eswa.2025.128123\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Video-based person re-identification (Video-Re-ID) is a crucial application in practical scenarios and has gradually emerged as one of the most popular research topics in the field of computer vision. Although many efforts have been made, it still remains a challenge due to substantial variations among person videos and even within each video. For Video-Re-ID, we propose a synchronous intrA-video and intEr-video distance metric learning approach based on temporal ViT architecture, termed as TAE-ViT. The TAE-ViT model, in particular consists of a View Guide Patch Embedding (VGPE) module, a Spatial-Temporal Attention (STA) module and an Intra and Inter-video Distance Metric Learning (IIDML) module. The VGPE module is used to utilize diverse view information and extract discriminative features with perspective invariance. The STA module alternately learns the spatial and temporal information by using the spatial and temporal multi-head attention operation, respectively. The IIDML module simultaneously captures intra-video and inter-video distance metrics from the training videos. Specifically, the intra-video distance metric aims to compact each video representation, while the inter-video distance metric ensures that truly matched videos are closer in distance compared to incorrectly matched ones. Experimental results show that our method achieves the best mAP (86.7 %, 96.3 %, 97.3 %) on three public Video-ReID datasets and achieves the best Rank-1 (93.3 %, 96.6 %) on two datasets, reducing the error of state-of-the-art methods by 1.48-45.16 % on mAP and by 2.86 % on Rank-1. Its robust performance highlights its potential for real-world applications like intelligent surveillance and public safety. Our code will be available at <span><span>https://github.com/JingShenZhuangTaiYiChang/TAE-ViT</span><svg><path></path></svg></span>.</div></div>\",\"PeriodicalId\":50461,\"journal\":{\"name\":\"Expert Systems with Applications\",\"volume\":\"286 \",\"pages\":\"Article 128123\"},\"PeriodicalIF\":7.5000,\"publicationDate\":\"2025-05-16\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Expert Systems with Applications\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0957417425017440\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Expert Systems with Applications","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0957417425017440","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0

摘要

基于视频的人物再识别(Video-Re-ID)是一项重要的实际应用,已逐渐成为计算机视觉领域最热门的研究课题之一。尽管已经做出了许多努力,但由于个人视频之间甚至每个视频内部的巨大差异,这仍然是一个挑战。对于Video-Re-ID,我们提出了一种基于时间ViT架构的同步视频内和视频间距离度量学习方法,称为TAE-ViT。TAE-ViT模型特别由一个视点引导补丁嵌入(VGPE)模块、一个时空注意(STA)模块和一个视频内和视频间距离度量学习(IIDML)模块组成。VGPE模块利用多样化的视图信息,提取具有透视不变性的判别特征。STA模块分别使用空间多头注意操作和时间多头注意操作交替学习空间和时间信息。IIDML模块同时从训练视频中捕获视频内和视频间的距离度量。具体来说,视频内距离度量旨在压缩每个视频表示,而视频间距离度量确保真正匹配的视频比不匹配的视频距离更近。实验结果表明,该方法在3个公共视频- reid数据集上实现了最佳mAP(86.7%, 96.3%, 97.3%),在2个数据集上实现了最佳Rank-1(93.3%, 96.6%),将现有方法在mAP上的误差降低了1.48 ~ 45.16%,在Rank-1上的误差降低了2.86%。其强大的性能突出了其在智能监控和公共安全等现实应用中的潜力。我们的代码可以在https://github.com/JingShenZhuangTaiYiChang/TAE-ViT上找到。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Learning discriminative features via deep metric learning for video-based person re-identification
Video-based person re-identification (Video-Re-ID) is a crucial application in practical scenarios and has gradually emerged as one of the most popular research topics in the field of computer vision. Although many efforts have been made, it still remains a challenge due to substantial variations among person videos and even within each video. For Video-Re-ID, we propose a synchronous intrA-video and intEr-video distance metric learning approach based on temporal ViT architecture, termed as TAE-ViT. The TAE-ViT model, in particular consists of a View Guide Patch Embedding (VGPE) module, a Spatial-Temporal Attention (STA) module and an Intra and Inter-video Distance Metric Learning (IIDML) module. The VGPE module is used to utilize diverse view information and extract discriminative features with perspective invariance. The STA module alternately learns the spatial and temporal information by using the spatial and temporal multi-head attention operation, respectively. The IIDML module simultaneously captures intra-video and inter-video distance metrics from the training videos. Specifically, the intra-video distance metric aims to compact each video representation, while the inter-video distance metric ensures that truly matched videos are closer in distance compared to incorrectly matched ones. Experimental results show that our method achieves the best mAP (86.7 %, 96.3 %, 97.3 %) on three public Video-ReID datasets and achieves the best Rank-1 (93.3 %, 96.6 %) on two datasets, reducing the error of state-of-the-art methods by 1.48-45.16 % on mAP and by 2.86 % on Rank-1. Its robust performance highlights its potential for real-world applications like intelligent surveillance and public safety. Our code will be available at https://github.com/JingShenZhuangTaiYiChang/TAE-ViT.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Expert Systems with Applications
Expert Systems with Applications 工程技术-工程:电子与电气
CiteScore
13.80
自引率
10.60%
发文量
2045
审稿时长
8.7 months
期刊介绍: Expert Systems With Applications is an international journal dedicated to the exchange of information on expert and intelligent systems used globally in industry, government, and universities. The journal emphasizes original papers covering the design, development, testing, implementation, and management of these systems, offering practical guidelines. It spans various sectors such as finance, engineering, marketing, law, project management, information management, medicine, and more. The journal also welcomes papers on multi-agent systems, knowledge management, neural networks, knowledge discovery, data mining, and other related areas, excluding applications to military/defense systems.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信