Unsupervised Part Discovery via Dual Representation Alignment.

IEEE transactions on pattern analysis and machine intelligence Pub Date : 2024-08-19 DOI:10.1109/TPAMI.2024.3445582

Jiahao Xia, Wenjian Huang, Min Xu, Jianguo Zhang, Haimin Zhang, Ziyu Sheng, Dong Xu

{"title":"Unsupervised Part Discovery via Dual Representation Alignment.","authors":"Jiahao Xia, Wenjian Huang, Min Xu, Jianguo Zhang, Haimin Zhang, Ziyu Sheng, Dong Xu","doi":"10.1109/TPAMI.2024.3445582","DOIUrl":null,"url":null,"abstract":"<p><p>Object parts serve as crucial intermediate representations in various downstream tasks, but part-level representation learning still has not received as much attention as other vision tasks. Previous research has established that Vision Transformer can learn instance-level attention without labels, extracting high-quality instance-level representations for boosting downstream tasks. In this paper, we achieve unsupervised part-specific attention learning using a novel paradigm and further employ the part representations to improve part discovery performance. Specifically, paired images are generated from the same image with different geometric transformations, and multiple part representations are extracted from these paired images using a novel module, named PartFormer. These part representations from the paired images are then exchanged to improve geometric transformation invariance. Subsequently, the part representations are aligned with the feature map extracted by a feature map encoder, achieving high similarity with the pixel representations of the corresponding part regions and low similarity in irrelevant regions. Finally, the geometric and semantic constraints are applied to the part representations through the intermediate results in alignment for part-specific attention learning, encouraging the PartFormer to focus locally and the part representations to explicitly include the information of the corresponding parts. Moreover, the aligned part representations can further serve as a series of reliable detectors in the testing phase, predicting pixel masks for part discovery. Extensive experiments are carried out on four widely used datasets, and our results demonstrate that the proposed method achieves competitive performance and robustness due to its part-specific attention. The code will be released upon paper acceptance.</p>","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-08-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on pattern analysis and machine intelligence","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/TPAMI.2024.3445582","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Object parts serve as crucial intermediate representations in various downstream tasks, but part-level representation learning still has not received as much attention as other vision tasks. Previous research has established that Vision Transformer can learn instance-level attention without labels, extracting high-quality instance-level representations for boosting downstream tasks. In this paper, we achieve unsupervised part-specific attention learning using a novel paradigm and further employ the part representations to improve part discovery performance. Specifically, paired images are generated from the same image with different geometric transformations, and multiple part representations are extracted from these paired images using a novel module, named PartFormer. These part representations from the paired images are then exchanged to improve geometric transformation invariance. Subsequently, the part representations are aligned with the feature map extracted by a feature map encoder, achieving high similarity with the pixel representations of the corresponding part regions and low similarity in irrelevant regions. Finally, the geometric and semantic constraints are applied to the part representations through the intermediate results in alignment for part-specific attention learning, encouraging the PartFormer to focus locally and the part representations to explicitly include the information of the corresponding parts. Moreover, the aligned part representations can further serve as a series of reliable detectors in the testing phase, predicting pixel masks for part discovery. Extensive experiments are carried out on four widely used datasets, and our results demonstrate that the proposed method achieves competitive performance and robustness due to its part-specific attention. The code will be released upon paper acceptance.

查看原文本刊更多论文

通过双重表征对齐进行无监督部件发现。

物体部件是各种下游任务的重要中间表征，但部件级表征学习仍未像其他视觉任务那样受到广泛关注。先前的研究已经证实，Vision Transformer 可以在没有标签的情况下学习实例级注意力，从而提取高质量的实例级表征，用于促进下游任务。在本文中，我们利用一种新颖的范式实现了无监督的特定部件注意力学习，并进一步利用部件表征来提高部件发现性能。具体来说，我们从具有不同几何变换的同一幅图像中生成配对图像，并使用名为 "PartFormer "的新模块从这些配对图像中提取多个零件表征。然后交换配对图像中的这些零件表示，以提高几何变换不变性。随后，将零件表示与特征图编码器提取的特征图对齐，实现与相应零件区域像素表示的高相似性和无关区域的低相似性。最后，通过对齐的中间结果，将几何和语义约束应用到零件表征中，以实现特定零件的注意力学习，从而鼓励零件成型器局部聚焦，并使零件表征明确包含相应零件的信息。此外，对齐后的部件表征还能在测试阶段进一步充当一系列可靠的检测器，预测用于发现部件的像素掩码。我们在四个广泛使用的数据集上进行了广泛的实验，结果表明，所提出的方法因其对特定部分的关注而获得了具有竞争力的性能和鲁棒性。代码将在论文被接受后发布。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE transactions on pattern analysis and machine intelligence

自引率

0.00%

发文量

文献相关原料

公司名称	产品信息	采购帮参考价格