Distilling Self-Supervised Vision Transformers for Weakly-Supervised Few-Shot Classification & Segmentation

2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Pub Date : 2023-06-01 DOI:10.1109/CVPR52729.2023.01880

Dahyun Kang, Piotr Koniusz, Minsu Cho, Naila Murray

{"title":"Distilling Self-Supervised Vision Transformers for Weakly-Supervised Few-Shot Classification & Segmentation","authors":"Dahyun Kang, Piotr Koniusz, Minsu Cho, Naila Murray","doi":"10.1109/CVPR52729.2023.01880","DOIUrl":null,"url":null,"abstract":"We address the task of weakly-supervised few-shot image classification and segmentation, by leveraging a Vision Transformer (ViT) pretrained with self-supervision. Our proposed method takes token representations from the self-supervised ViT and leverages their correlations, via selfattention, to produce classification and segmentation predictions through separate task heads. Our model is able to effectively learn to perform classification and segmentation in the absence of pixel-level labels during training, using only image-level labels. To do this it uses attention maps, created from tokens generated by the self-supervised ViT backbone, as pixel-level pseudo-labels. We also explore a practical setup with “mixed” supervision, where a small number of training images contains ground-truth pixel-level labels and the remaining images have only image-level labels. For this mixed setup, we propose to improve the pseudo-labels using a pseudo-label enhancer that was trained using the available ground-truth pixel-level labels. Experiments on Pascal-5i and COCO-20i demonstrate significant performance gains in a variety of supervision settings, and in particular when little-to-no pixel-level labels are available.","PeriodicalId":376416,"journal":{"name":"2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"41 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CVPR52729.2023.01880","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 5

Abstract

We address the task of weakly-supervised few-shot image classification and segmentation, by leveraging a Vision Transformer (ViT) pretrained with self-supervision. Our proposed method takes token representations from the self-supervised ViT and leverages their correlations, via selfattention, to produce classification and segmentation predictions through separate task heads. Our model is able to effectively learn to perform classification and segmentation in the absence of pixel-level labels during training, using only image-level labels. To do this it uses attention maps, created from tokens generated by the self-supervised ViT backbone, as pixel-level pseudo-labels. We also explore a practical setup with “mixed” supervision, where a small number of training images contains ground-truth pixel-level labels and the remaining images have only image-level labels. For this mixed setup, we propose to improve the pseudo-labels using a pseudo-label enhancer that was trained using the available ground-truth pixel-level labels. Experiments on Pascal-5i and COCO-20i demonstrate significant performance gains in a variety of supervision settings, and in particular when little-to-no pixel-level labels are available.

查看原文本刊更多论文

用于弱监督少镜头分类分割的自监督视觉变换

我们通过利用具有自我监督预训练的视觉转换器(Vision Transformer, ViT)来解决弱监督的少量图像分类和分割任务。我们提出的方法从自监督ViT中获取令牌表示，并通过自关注利用它们的相关性，通过单独的任务头产生分类和分割预测。我们的模型能够在训练过程中有效地学习在没有像素级标签的情况下进行分类和分割，只使用图像级标签。为了做到这一点，它使用注意力图，由自监督ViT主干生成的令牌创建，作为像素级伪标签。我们还探索了一种“混合”监督的实际设置，其中少数训练图像包含真实的像素级标签，其余图像只有图像级标签。对于这种混合设置，我们建议使用伪标签增强器来改进伪标签，该伪标签增强器使用可用的ground-truth像素级标签进行训练。在Pascal-5i和COCO-20i上的实验表明，在各种监督设置下，特别是在很少甚至没有像素级标签可用的情况下，性能得到了显著提高。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

自引率

0.00%

发文量