Rethinking Self-Supervised Semantic Segmentation: Achieving End-to-End Segmentation.

Yue Liu, Jun Zeng, Xingzhen Tao, Gang Fang
{"title":"Rethinking Self-Supervised Semantic Segmentation: Achieving End-to-End Segmentation.","authors":"Yue Liu, Jun Zeng, Xingzhen Tao, Gang Fang","doi":"10.1109/TPAMI.2024.3432326","DOIUrl":null,"url":null,"abstract":"<p><p>The challenge of semantic segmentation with scarce pixel-level annotations has induced many self-supervised works, however most of which essentially train an image encoder or a segmentation head that produces finer dense representations, and when performing segmentation inference they need to resort to supervised linear classifiers or traditional clustering. Segmentation by dataset-level clustering not only deviates the real-time and end-to-end inference practice, but also escalates the problem from segmenting per image to clustering all pixels at once, which results in downgraded performance. To remedy this issue, we propose a novel self-supervised semantic segmentation training and inferring paradigm where inferring is performed in an end-to-end manner. Specifically, based on our observations in probing dense representation by image-level self-supervised ViT, i.e. semantic inconsistency between patches and poor semantic quality in non-salient regions, we propose prototype-image alignment and global-local alignment with attention map constraint to train a tailored Transformer Decoder with learnable prototypes and utilize adaptive prototypes for segmentation inference per image. Extensive experiments under fully unsupervised semantic segmentation settings demonstrate the superior performance and the generalizability of our proposed method. The code is available at: https://github.com/yliu1229/AlignSeg.</p>","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on pattern analysis and machine intelligence","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/TPAMI.2024.3432326","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

The challenge of semantic segmentation with scarce pixel-level annotations has induced many self-supervised works, however most of which essentially train an image encoder or a segmentation head that produces finer dense representations, and when performing segmentation inference they need to resort to supervised linear classifiers or traditional clustering. Segmentation by dataset-level clustering not only deviates the real-time and end-to-end inference practice, but also escalates the problem from segmenting per image to clustering all pixels at once, which results in downgraded performance. To remedy this issue, we propose a novel self-supervised semantic segmentation training and inferring paradigm where inferring is performed in an end-to-end manner. Specifically, based on our observations in probing dense representation by image-level self-supervised ViT, i.e. semantic inconsistency between patches and poor semantic quality in non-salient regions, we propose prototype-image alignment and global-local alignment with attention map constraint to train a tailored Transformer Decoder with learnable prototypes and utilize adaptive prototypes for segmentation inference per image. Extensive experiments under fully unsupervised semantic segmentation settings demonstrate the superior performance and the generalizability of our proposed method. The code is available at: https://github.com/yliu1229/AlignSeg.

反思自我监督语义分割:实现端到端分割。
利用稀缺的像素级注释进行语义分割所面临的挑战引发了许多自监督工作,但其中大多数工作基本上都是训练图像编码器或分割头,以生成更精细的密集表示,而在执行分割推理时,它们需要借助监督线性分类器或传统聚类。通过数据集级聚类进行分割不仅偏离了实时和端到端的推理实践,而且将问题从每幅图像的分割升级为一次性对所有像素进行聚类,从而导致性能下降。为了解决这个问题,我们提出了一种新颖的自监督语义分割训练和推理范例,在这种范例中,推理是以端到端的方式进行的。具体来说,根据我们在图像级自监督 ViT 的密集表征探测中观察到的问题,即斑块之间的语义不一致和非倾斜区域的语义质量差,我们提出了原型-图像对齐和全局-局部对齐的注意图约束,用可学习的原型来训练定制的变换器解码器,并利用自适应原型进行每幅图像的分割推理。在完全无监督的语义分割设置下进行的大量实验证明了我们提出的方法具有卓越的性能和通用性。代码见:https://github.com/yliu1229/AlignSeg。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
文献相关原料
公司名称 产品信息 采购帮参考价格
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信