DatUS: Data-Driven Unsupervised Semantic Segmentation With Pretrained Self-Supervised Vision Transformer

IF 4.9 3区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

IEEE Transactions on Cognitive and Developmental Systems Pub Date : 2024-04-02 DOI:10.1109/TCDS.2024.3383952

Sonal Kumar;Arijit Sur;Rashmi Dutta Baruah

{"title":"DatUS: Data-Driven Unsupervised Semantic Segmentation With Pretrained Self-Supervised Vision Transformer","authors":"Sonal Kumar;Arijit Sur;Rashmi Dutta Baruah","doi":"10.1109/TCDS.2024.3383952","DOIUrl":null,"url":null,"abstract":"Successive proposals of several self-supervised training schemes (STSs) continue to emerge, taking one step closer to developing a universal foundation model. In this process, unsupervised downstream tasks are recognized as one of the evaluation methods to validate the quality of visual features learned with self-supervised training. However, unsupervised dense semantic segmentation has yet to be explored as a downstream task, which can utilize and evaluate the quality of semantic information introduced in patch-level feature representations during self-supervised training of vision transformers. Therefore, we propose a novel data-driven framework, DatUS, to perform unsupervised dense semantic segmentation (DSS) as a downstream task. DatUS generates semantically consistent pseudosegmentation masks for an unlabeled image dataset without using visual prior or synchronized data. The experiment shows that the proposed framework achieves the highest MIoU (24.90) and average F1 score (36.3) by choosing DINOv2 and the highest pixel accuracy (62.18) by choosing DINO as the STS on the training set of SUIM dataset. It also outperforms state-of-the-art methods for the unsupervised DSS task with 15.02% MIoU, 21.47% pixel accuracy, and 16.06% average F1 score on the validation set of SUIM dataset. It achieves a competitive level of accuracy for a large-scale COCO dataset.","PeriodicalId":54300,"journal":{"name":"IEEE Transactions on Cognitive and Developmental Systems","volume":"16 5","pages":"1775-1788"},"PeriodicalIF":4.9000,"publicationDate":"2024-04-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Cognitive and Developmental Systems","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10488760/","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Successive proposals of several self-supervised training schemes (STSs) continue to emerge, taking one step closer to developing a universal foundation model. In this process, unsupervised downstream tasks are recognized as one of the evaluation methods to validate the quality of visual features learned with self-supervised training. However, unsupervised dense semantic segmentation has yet to be explored as a downstream task, which can utilize and evaluate the quality of semantic information introduced in patch-level feature representations during self-supervised training of vision transformers. Therefore, we propose a novel data-driven framework, DatUS, to perform unsupervised dense semantic segmentation (DSS) as a downstream task. DatUS generates semantically consistent pseudosegmentation masks for an unlabeled image dataset without using visual prior or synchronized data. The experiment shows that the proposed framework achieves the highest MIoU (24.90) and average F1 score (36.3) by choosing DINOv2 and the highest pixel accuracy (62.18) by choosing DINO as the STS on the training set of SUIM dataset. It also outperforms state-of-the-art methods for the unsupervised DSS task with 15.02% MIoU, 21.47% pixel accuracy, and 16.06% average F1 score on the validation set of SUIM dataset. It achieves a competitive level of accuracy for a large-scale COCO dataset.

查看原文本刊更多论文

DatUS：数据驱动的无监督语义分割与预训练的自监督视觉转换器

一些自我监督训练方案（STS）的提案不断涌现，向开发通用基础模型的目标又迈进了一步。在这一过程中，无监督下游任务被认为是验证通过自我监督训练学习到的视觉特征质量的评估方法之一。然而，无监督密集语义分割作为一种下游任务，还有待于探索，它可以利用和评估在视觉转换器的自我监督训练过程中引入到补丁级特征表征中的语义信息的质量。因此，我们提出了一种新颖的数据驱动框架 DatUS，将无监督密集语义分割（DSS）作为一项下游任务来执行。DatUS 无需使用视觉先验数据或同步数据，即可为无标记图像数据集生成语义一致的伪分割掩码。实验结果表明，在 SUIM 数据集的训练集上，通过选择 DINOv2，提议的框架获得了最高的 MIoU（24.90）和平均 F1 分数（36.3）；通过选择 DINO 作为 STS，提议的框架获得了最高的像素准确率（62.18）。在 SUIM 数据集的验证集上，它还以 15.02% 的 MIoU、21.47% 的像素准确率和 16.06% 的平均 F1 得分超越了无监督 DSS 任务的先进方法。对于大规模 COCO 数据集来说，它达到了具有竞争力的准确率水平。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Transactions on Cognitive and Developmental Systems Computer Science-Software

CiteScore

7.20

自引率

10.00%

发文量

170

期刊介绍： The IEEE Transactions on Cognitive and Developmental Systems (TCDS) focuses on advances in the study of development and cognition in natural (humans, animals) and artificial (robots, agents) systems. It welcomes contributions from multiple related disciplines including cognitive systems, cognitive robotics, developmental and epigenetic robotics, autonomous and evolutionary robotics, social structures, multi-agent and artificial life systems, computational neuroscience, and developmental psychology. Articles on theoretical, computational, application-oriented, and experimental studies as well as reviews in these areas are considered.