Minjung Kim , MyeongAh Cho , Heansung Lee , Sangyoun Lee
{"title":"Spatio-temporal Feature-level Augmentation Vision Transformer for video-based person re-identification","authors":"Minjung Kim , MyeongAh Cho , Heansung Lee , Sangyoun Lee","doi":"10.1016/j.patcog.2025.111813","DOIUrl":null,"url":null,"abstract":"<div><div>Video-based person re-identification (ReID) aims to match an individual across multiple videos, thus addressing critical aspects of security applications of computer vision. While previous transformer-based approaches have used various means to enhance performance, the growing complexities in network design have posed challenges in meeting the practical requirements of intelligent surveillance systems. To improve network efficiency, we introduce a Feature-level Augmentation Vision Transformer (FAViT), which reinterprets the attributes of video ReID. We leverage the property of maintaining identity even when backgrounds change or multiple persons appear in video frames. First, we introduce Token Representation Learning to distinguish foreground from background. We also employ spatio-temporal feature-level augmentation, along with conducting Altered Background ID classification and Anomaly Frame Detection, to strengthen the representation capacity of the transformer. Extensive experiments validate the effectiveness of FAViT with the least computational overhead among transformer-based models across five benchmarks. We substantiate our model’s generalization ability through analyses.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"168 ","pages":"Article 111813"},"PeriodicalIF":7.5000,"publicationDate":"2025-05-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Pattern Recognition","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S003132032500473X","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Video-based person re-identification (ReID) aims to match an individual across multiple videos, thus addressing critical aspects of security applications of computer vision. While previous transformer-based approaches have used various means to enhance performance, the growing complexities in network design have posed challenges in meeting the practical requirements of intelligent surveillance systems. To improve network efficiency, we introduce a Feature-level Augmentation Vision Transformer (FAViT), which reinterprets the attributes of video ReID. We leverage the property of maintaining identity even when backgrounds change or multiple persons appear in video frames. First, we introduce Token Representation Learning to distinguish foreground from background. We also employ spatio-temporal feature-level augmentation, along with conducting Altered Background ID classification and Anomaly Frame Detection, to strengthen the representation capacity of the transformer. Extensive experiments validate the effectiveness of FAViT with the least computational overhead among transformer-based models across five benchmarks. We substantiate our model’s generalization ability through analyses.
期刊介绍:
The field of Pattern Recognition is both mature and rapidly evolving, playing a crucial role in various related fields such as computer vision, image processing, text analysis, and neural networks. It closely intersects with machine learning and is being applied in emerging areas like biometrics, bioinformatics, multimedia data analysis, and data science. The journal Pattern Recognition, established half a century ago during the early days of computer science, has since grown significantly in scope and influence.