Semantic-Aware Auto-Encoders for Self-supervised Representation Learning

2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Pub Date : 2022-06-01 DOI:10.1109/CVPR52688.2022.00944

Guangrun Wang, Yansong Tang, Liang Lin, Philip H. S. Torr

{"title":"Semantic-Aware Auto-Encoders for Self-supervised Representation Learning","authors":"Guangrun Wang, Yansong Tang, Liang Lin, Philip H. S. Torr","doi":"10.1109/CVPR52688.2022.00944","DOIUrl":null,"url":null,"abstract":"The resurgence of unsupervised learning can be attributed to the remarkable progress of self-supervised learning, which includes generative $(\\mathcal{G})$ and discriminative $(\\mathcal{D})$ models. In computer vision, the mainstream self-supervised learning algorithms are $\\mathcal{D}$ models. However, designing a $\\mathcal{D}$ model could be over-complicated; also, some studies hinted that a $\\mathcal{D}$ model might not be as general and interpretable as a $\\mathcal{G}$ model. In this paper, we switch from $\\mathcal{D}$ models to $\\mathcal{G}$ models using the classical auto-encoder $(AE)$. Note that a vanilla $\\mathcal{G}$ model was far less efficient than a $\\mathcal{D}$ model in self-supervised computer vision tasks, as it wastes model capability on overfitting semantic-agnostic high-frequency details. Inspired by perceptual learning that could use cross-view learning to perceive concepts and semantics11Following [26], we refer to semantics as visual concepts, e.g., a semantic-ware model indicates the model can perceive visual concepts, and the learned features are efficient in object recognition, detection, etc., we propose a novel $AE$ that could learn semantic-aware representation via cross-view image reconstruction. We use one view of an image as the input and another view of the same image as the reconstruction target. This kind of $AE$ has rarely been studied before, and the optimization is very difficult. To enhance learning ability and find a feasible solution, we propose a semantic aligner that uses geometric transformation knowledge to align the hidden code of $AE$ to help optimization. These techniques significantly improve the representation learning ability of $AE$ and make selfsupervised learning with $\\mathcal{G}$ models possible. Extensive experiments on many large-scale benchmarks (e.g., ImageNet, COCO 2017, and SYSU-30k) demonstrate the effectiveness of our methods. Code is available at https://github.com/wanggrun/Semantic-Aware-AE.","PeriodicalId":355552,"journal":{"name":"2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"7","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CVPR52688.2022.00944","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 7

Abstract

The resurgence of unsupervised learning can be attributed to the remarkable progress of self-supervised learning, which includes generative $(\mathcal{G})$ and discriminative $(\mathcal{D})$ models. In computer vision, the mainstream self-supervised learning algorithms are $\mathcal{D}$ models. However, designing a $\mathcal{D}$ model could be over-complicated; also, some studies hinted that a $\mathcal{D}$ model might not be as general and interpretable as a $\mathcal{G}$ model. In this paper, we switch from $\mathcal{D}$ models to $\mathcal{G}$ models using the classical auto-encoder $(AE)$. Note that a vanilla $\mathcal{G}$ model was far less efficient than a $\mathcal{D}$ model in self-supervised computer vision tasks, as it wastes model capability on overfitting semantic-agnostic high-frequency details. Inspired by perceptual learning that could use cross-view learning to perceive concepts and semantics11Following [26], we refer to semantics as visual concepts, e.g., a semantic-ware model indicates the model can perceive visual concepts, and the learned features are efficient in object recognition, detection, etc., we propose a novel $AE$ that could learn semantic-aware representation via cross-view image reconstruction. We use one view of an image as the input and another view of the same image as the reconstruction target. This kind of $AE$ has rarely been studied before, and the optimization is very difficult. To enhance learning ability and find a feasible solution, we propose a semantic aligner that uses geometric transformation knowledge to align the hidden code of $AE$ to help optimization. These techniques significantly improve the representation learning ability of $AE$ and make selfsupervised learning with $\mathcal{G}$ models possible. Extensive experiments on many large-scale benchmarks (e.g., ImageNet, COCO 2017, and SYSU-30k) demonstrate the effectiveness of our methods. Code is available at https://github.com/wanggrun/Semantic-Aware-AE.

查看原文本刊更多论文

用于自监督表示学习的语义感知自编码器

无监督学习的复苏可以归因于自监督学习的显著进步，其中包括生成式$(\mathcal{G})$和判别式$(\mathcal{D})$模型。在计算机视觉中，主流的自监督学习算法是$\mathcal{D}$模型。然而，设计一个$\mathcal{D}$模型可能过于复杂;此外，一些研究暗示$\mathcal{D}$模型可能不如$\mathcal{G}$模型一般和可解释。在本文中，我们使用经典的自编码器$(AE)$从$\mathcal{D}$模型切换到$\mathcal{G}$模型。请注意，在自监督计算机视觉任务中，普通的$\mathcal{G}$模型的效率远远低于$\mathcal{D}$模型，因为它将模型能力浪费在了过度拟合语义不可知的高频细节上。受可以使用跨视图学习来感知概念和语义的感知学习的启发[26]，我们将语义称为视觉概念，例如，语义感知模型表明模型可以感知视觉概念，并且学习的特征在对象识别，检测等方面是有效的，我们提出了一种新的$AE$，可以通过跨视图图像重建来学习语义感知表示。我们使用图像的一个视图作为输入，并使用同一图像的另一个视图作为重建目标。这类AE的研究很少，优化难度很大。为了提高学习能力并找到可行的解决方案，我们提出了一种语义对齐器，利用几何变换知识对齐$AE$的隐藏代码以帮助优化。这些技术显著提高了$AE$的表示学习能力，使$\mathcal{G}$模型的自监督学习成为可能。在许多大规模基准测试(例如ImageNet、COCO 2017和SYSU-30k)上进行的大量实验证明了我们方法的有效性。代码可从https://github.com/wanggrun/Semantic-Aware-AE获得。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

自引率

0.00%

发文量