复制-粘贴连贯深度区域改进了城市场景分割的对比学习

BMVC : proceedings of the British Machine Vision Conference. British Machine Vision Conference Pub Date : 2022-11-25 DOI:10.48550/arXiv.2211.14074

Liang Zeng, A. Lengyel, Nergis Tomen, J. V. Gemert

{"title":"复制-粘贴连贯深度区域改进了城市场景分割的对比学习","authors":"Liang Zeng, A. Lengyel, Nergis Tomen, J. V. Gemert","doi":"10.48550/arXiv.2211.14074","DOIUrl":null,"url":null,"abstract":"In this work, we leverage estimated depth to boost self-supervised contrastive learning for segmentation of urban scenes, where unlabeled videos are readily available for training self-supervised depth estimation. We argue that the semantics of a coherent group of pixels in 3D space is self-contained and invariant to the contexts in which they appear. We group coherent, semantically related pixels into coherent depth regions given their estimated depth and use copy-paste to synthetically vary their contexts. In this way, cross-context correspondences are built in contrastive learning and a context-invariant representation is learned. For unsupervised semantic segmentation of urban scenes, our method surpasses the previous state-of-the-art baseline by +7.14% in mIoU on Cityscapes and +6.65% on KITTI. For fine-tuning on Cityscapes and KITTI segmentation, our method is competitive with existing models, yet, we do not need to pre-train on ImageNet or COCO, and we are also more computationally efficient. Our code is available on https://github.com/LeungTsang/CPCDR","PeriodicalId":72437,"journal":{"name":"BMVC : proceedings of the British Machine Vision Conference. British Machine Vision Conference","volume":"75 1","pages":"893"},"PeriodicalIF":0.0000,"publicationDate":"2022-11-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Copy-Pasting Coherent Depth Regions Improves Contrastive Learning for Urban-Scene Segmentation\",\"authors\":\"Liang Zeng, A. Lengyel, Nergis Tomen, J. V. Gemert\",\"doi\":\"10.48550/arXiv.2211.14074\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In this work, we leverage estimated depth to boost self-supervised contrastive learning for segmentation of urban scenes, where unlabeled videos are readily available for training self-supervised depth estimation. We argue that the semantics of a coherent group of pixels in 3D space is self-contained and invariant to the contexts in which they appear. We group coherent, semantically related pixels into coherent depth regions given their estimated depth and use copy-paste to synthetically vary their contexts. In this way, cross-context correspondences are built in contrastive learning and a context-invariant representation is learned. For unsupervised semantic segmentation of urban scenes, our method surpasses the previous state-of-the-art baseline by +7.14% in mIoU on Cityscapes and +6.65% on KITTI. For fine-tuning on Cityscapes and KITTI segmentation, our method is competitive with existing models, yet, we do not need to pre-train on ImageNet or COCO, and we are also more computationally efficient. Our code is available on https://github.com/LeungTsang/CPCDR\",\"PeriodicalId\":72437,\"journal\":{\"name\":\"BMVC : proceedings of the British Machine Vision Conference. British Machine Vision Conference\",\"volume\":\"75 1\",\"pages\":\"893\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-11-25\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"BMVC : proceedings of the British Machine Vision Conference. British Machine Vision Conference\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.48550/arXiv.2211.14074\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"BMVC : proceedings of the British Machine Vision Conference. British Machine Vision Conference","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.48550/arXiv.2211.14074","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

在这项工作中，我们利用估计的深度来促进城市场景分割的自监督对比学习，其中未标记的视频很容易用于训练自监督深度估计。我们认为，在三维空间中，一个连贯的像素组的语义是自包含的，并且对它们出现的上下文是不变的。我们将连贯的，语义相关的像素分组到给定其估计深度的连贯深度区域，并使用复制粘贴来综合改变其上下文。通过这种方式，在对比学习中建立了跨上下文对应关系，并学习了上下文不变的表示。对于城市场景的无监督语义分割，我们的方法在城市场景的mIoU和KITTI上分别超过了之前最先进的基线+7.14%和+6.65%。对于城市景观和KITTI分割的微调，我们的方法与现有模型相比具有竞争力，然而，我们不需要在ImageNet或COCO上进行预训练，并且我们的计算效率也更高。我们的代码可以在https://github.com/LeungTsang/CPCDR上找到

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Copy-Pasting Coherent Depth Regions Improves Contrastive Learning for Urban-Scene Segmentation

In this work, we leverage estimated depth to boost self-supervised contrastive learning for segmentation of urban scenes, where unlabeled videos are readily available for training self-supervised depth estimation. We argue that the semantics of a coherent group of pixels in 3D space is self-contained and invariant to the contexts in which they appear. We group coherent, semantically related pixels into coherent depth regions given their estimated depth and use copy-paste to synthetically vary their contexts. In this way, cross-context correspondences are built in contrastive learning and a context-invariant representation is learned. For unsupervised semantic segmentation of urban scenes, our method surpasses the previous state-of-the-art baseline by +7.14% in mIoU on Cityscapes and +6.65% on KITTI. For fine-tuning on Cityscapes and KITTI segmentation, our method is competitive with existing models, yet, we do not need to pre-train on ImageNet or COCO, and we are also more computationally efficient. Our code is available on https://github.com/LeungTsang/CPCDR

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

BMVC : proceedings of the British Machine Vision Conference. British Machine Vision Conference

自引率

0.00%

发文量