{"title":"Rethinking Self-Supervised Semantic Segmentation: Achieving End-to-End Segmentation.","authors":"Yue Liu, Jun Zeng, Xingzhen Tao, Gang Fang","doi":"10.1109/TPAMI.2024.3432326","DOIUrl":null,"url":null,"abstract":"<p><p>The challenge of semantic segmentation with scarce pixel-level annotations has induced many self-supervised works, however most of which essentially train an image encoder or a segmentation head that produces finer dense representations, and when performing segmentation inference they need to resort to supervised linear classifiers or traditional clustering. Segmentation by dataset-level clustering not only deviates the real-time and end-to-end inference practice, but also escalates the problem from segmenting per image to clustering all pixels at once, which results in downgraded performance. To remedy this issue, we propose a novel self-supervised semantic segmentation training and inferring paradigm where inferring is performed in an end-to-end manner. Specifically, based on our observations in probing dense representation by image-level self-supervised ViT, i.e. semantic inconsistency between patches and poor semantic quality in non-salient regions, we propose prototype-image alignment and global-local alignment with attention map constraint to train a tailored Transformer Decoder with learnable prototypes and utilize adaptive prototypes for segmentation inference per image. Extensive experiments under fully unsupervised semantic segmentation settings demonstrate the superior performance and the generalizability of our proposed method. The code is available at: https://github.com/yliu1229/AlignSeg.</p>","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on pattern analysis and machine intelligence","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/TPAMI.2024.3432326","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
The challenge of semantic segmentation with scarce pixel-level annotations has induced many self-supervised works, however most of which essentially train an image encoder or a segmentation head that produces finer dense representations, and when performing segmentation inference they need to resort to supervised linear classifiers or traditional clustering. Segmentation by dataset-level clustering not only deviates the real-time and end-to-end inference practice, but also escalates the problem from segmenting per image to clustering all pixels at once, which results in downgraded performance. To remedy this issue, we propose a novel self-supervised semantic segmentation training and inferring paradigm where inferring is performed in an end-to-end manner. Specifically, based on our observations in probing dense representation by image-level self-supervised ViT, i.e. semantic inconsistency between patches and poor semantic quality in non-salient regions, we propose prototype-image alignment and global-local alignment with attention map constraint to train a tailored Transformer Decoder with learnable prototypes and utilize adaptive prototypes for segmentation inference per image. Extensive experiments under fully unsupervised semantic segmentation settings demonstrate the superior performance and the generalizability of our proposed method. The code is available at: https://github.com/yliu1229/AlignSeg.
利用稀缺的像素级注释进行语义分割所面临的挑战引发了许多自监督工作,但其中大多数工作基本上都是训练图像编码器或分割头,以生成更精细的密集表示,而在执行分割推理时,它们需要借助监督线性分类器或传统聚类。通过数据集级聚类进行分割不仅偏离了实时和端到端的推理实践,而且将问题从每幅图像的分割升级为一次性对所有像素进行聚类,从而导致性能下降。为了解决这个问题,我们提出了一种新颖的自监督语义分割训练和推理范例,在这种范例中,推理是以端到端的方式进行的。具体来说,根据我们在图像级自监督 ViT 的密集表征探测中观察到的问题,即斑块之间的语义不一致和非倾斜区域的语义质量差,我们提出了原型-图像对齐和全局-局部对齐的注意图约束,用可学习的原型来训练定制的变换器解码器,并利用自适应原型进行每幅图像的分割推理。在完全无监督的语义分割设置下进行的大量实验证明了我们提出的方法具有卓越的性能和通用性。代码见:https://github.com/yliu1229/AlignSeg。