{"title":"Perception Assisted Transformer for Unsupervised Object Re-Identification","authors":"Shuoyi Chen;Mang Ye;Xingping Dong;Bo Du","doi":"10.1109/TIP.2025.3553777","DOIUrl":null,"url":null,"abstract":"Unsupervised object re-identification (Re-ID) aims to learn discriminative features without identity annotations. Existing mainstream methods are usually developed based on convolutional neural networks for feature extraction and pseudo-label estimation. However, convolutional neural networks suffer from limitations in capturing dispersed long-range dependencies and integrating global information. In comparison, vision transformers demonstrate superior robustness in complex environments, leveraging their versatile modeling capabilities to process diverse data structures with greater precision. In this paper, we delve into the potential of vision transformers in unsupervised Re-ID, proposing a Transformer-based perception-assisted framework (PAT). Considering Re-ID is a typical fine-grained task, existing unsupervised Re-ID methods relying on pseudo-labels generated by clustering algorithms provide only category-level discriminative supervision, with limited attention to local details. Therefore, we propose a novel target-aware mask alignment (TMA) strategy that provides additional supervision signals by leveraging low-level visual cues. Specifically, we employ pseudo-labels to guide the fine-grained alignment of features with local pixel information from critical discriminative regions. This method establishes a mutual learning mechanism via a shared Transformer, effectively balancing discriminative learning and detailed understanding. Furthermore, we propose a perceptual fusion feature augmentation (PFA) method to optimize instance-level discriminative learning. The proposed method is evaluated on multiple Re-ID datasets, demonstrating superior performance and robustness in comparison to state-of-the-art techniques. Notably, without annotations, our method achieves better results than many supervised counterparts. The code will be released.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"2112-2123"},"PeriodicalIF":0.0000,"publicationDate":"2025-03-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10944266/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Unsupervised object re-identification (Re-ID) aims to learn discriminative features without identity annotations. Existing mainstream methods are usually developed based on convolutional neural networks for feature extraction and pseudo-label estimation. However, convolutional neural networks suffer from limitations in capturing dispersed long-range dependencies and integrating global information. In comparison, vision transformers demonstrate superior robustness in complex environments, leveraging their versatile modeling capabilities to process diverse data structures with greater precision. In this paper, we delve into the potential of vision transformers in unsupervised Re-ID, proposing a Transformer-based perception-assisted framework (PAT). Considering Re-ID is a typical fine-grained task, existing unsupervised Re-ID methods relying on pseudo-labels generated by clustering algorithms provide only category-level discriminative supervision, with limited attention to local details. Therefore, we propose a novel target-aware mask alignment (TMA) strategy that provides additional supervision signals by leveraging low-level visual cues. Specifically, we employ pseudo-labels to guide the fine-grained alignment of features with local pixel information from critical discriminative regions. This method establishes a mutual learning mechanism via a shared Transformer, effectively balancing discriminative learning and detailed understanding. Furthermore, we propose a perceptual fusion feature augmentation (PFA) method to optimize instance-level discriminative learning. The proposed method is evaluated on multiple Re-ID datasets, demonstrating superior performance and robustness in comparison to state-of-the-art techniques. Notably, without annotations, our method achieves better results than many supervised counterparts. The code will be released.