Perception Assisted Transformer for Unsupervised Object Re-Identification

IF 13.7

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society Pub Date : 2025-03-27 DOI:10.1109/TIP.2025.3553777

Shuoyi Chen;Mang Ye;Xingping Dong;Bo Du

{"title":"Perception Assisted Transformer for Unsupervised Object Re-Identification","authors":"Shuoyi Chen;Mang Ye;Xingping Dong;Bo Du","doi":"10.1109/TIP.2025.3553777","DOIUrl":null,"url":null,"abstract":"Unsupervised object re-identification (Re-ID) aims to learn discriminative features without identity annotations. Existing mainstream methods are usually developed based on convolutional neural networks for feature extraction and pseudo-label estimation. However, convolutional neural networks suffer from limitations in capturing dispersed long-range dependencies and integrating global information. In comparison, vision transformers demonstrate superior robustness in complex environments, leveraging their versatile modeling capabilities to process diverse data structures with greater precision. In this paper, we delve into the potential of vision transformers in unsupervised Re-ID, proposing a Transformer-based perception-assisted framework (PAT). Considering Re-ID is a typical fine-grained task, existing unsupervised Re-ID methods relying on pseudo-labels generated by clustering algorithms provide only category-level discriminative supervision, with limited attention to local details. Therefore, we propose a novel target-aware mask alignment (TMA) strategy that provides additional supervision signals by leveraging low-level visual cues. Specifically, we employ pseudo-labels to guide the fine-grained alignment of features with local pixel information from critical discriminative regions. This method establishes a mutual learning mechanism via a shared Transformer, effectively balancing discriminative learning and detailed understanding. Furthermore, we propose a perceptual fusion feature augmentation (PFA) method to optimize instance-level discriminative learning. The proposed method is evaluated on multiple Re-ID datasets, demonstrating superior performance and robustness in comparison to state-of-the-art techniques. Notably, without annotations, our method achieves better results than many supervised counterparts. The code will be released.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"2112-2123"},"PeriodicalIF":13.7000,"publicationDate":"2025-03-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10944266/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Unsupervised object re-identification (Re-ID) aims to learn discriminative features without identity annotations. Existing mainstream methods are usually developed based on convolutional neural networks for feature extraction and pseudo-label estimation. However, convolutional neural networks suffer from limitations in capturing dispersed long-range dependencies and integrating global information. In comparison, vision transformers demonstrate superior robustness in complex environments, leveraging their versatile modeling capabilities to process diverse data structures with greater precision. In this paper, we delve into the potential of vision transformers in unsupervised Re-ID, proposing a Transformer-based perception-assisted framework (PAT). Considering Re-ID is a typical fine-grained task, existing unsupervised Re-ID methods relying on pseudo-labels generated by clustering algorithms provide only category-level discriminative supervision, with limited attention to local details. Therefore, we propose a novel target-aware mask alignment (TMA) strategy that provides additional supervision signals by leveraging low-level visual cues. Specifically, we employ pseudo-labels to guide the fine-grained alignment of features with local pixel information from critical discriminative regions. This method establishes a mutual learning mechanism via a shared Transformer, effectively balancing discriminative learning and detailed understanding. Furthermore, we propose a perceptual fusion feature augmentation (PFA) method to optimize instance-level discriminative learning. The proposed method is evaluated on multiple Re-ID datasets, demonstrating superior performance and robustness in comparison to state-of-the-art techniques. Notably, without annotations, our method achieves better results than many supervised counterparts. The code will be released.

查看原文本刊更多论文

无监督对象再识别的感知辅助变压器

无监督对象再识别（Re-ID, Unsupervised object reidentification）的目的是学习不需要身份标注的判别特征。现有的主流方法通常是基于卷积神经网络进行特征提取和伪标签估计。然而，卷积神经网络在捕获分散的远程依赖关系和集成全局信息方面存在局限性。相比之下，视觉转换器在复杂环境中表现出优越的鲁棒性，利用其多功能建模能力以更高的精度处理各种数据结构。在本文中，我们深入研究了视觉变压器在无监督Re-ID中的潜力，提出了一个基于变压器的感知辅助框架（PAT）。考虑到Re-ID是典型的细粒度任务，现有的无监督Re-ID方法依赖于聚类算法生成的伪标签，只能提供类别级的判别监督，对局部细节的关注有限。因此，我们提出了一种新的目标感知掩膜对齐（TMA）策略，该策略通过利用低级视觉线索提供额外的监督信号。具体来说，我们使用伪标签来指导特征与来自关键区分区域的局部像素信息的细粒度对齐。该方法通过共享的Transformer建立了相互学习机制，有效地平衡了判别学习和详细理解。此外，我们提出了一种感知融合特征增强（PFA）方法来优化实例级判别学习。所提出的方法在多个Re-ID数据集上进行了评估，与最先进的技术相比，显示出优越的性能和鲁棒性。值得注意的是，在没有注释的情况下，我们的方法比许多有监督的方法取得了更好的结果。代码将被发布。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society

自引率

0.00%

发文量