Contextual Image Parsing via Panoptic Segment Sorting

Multimedia Understanding with Less Labeling on Multimedia Understanding with Less Labeling Pub Date : 2021-05-04 DOI:10.1145/3476098.3485056

Jyh-Jing Hwang, Tsung-Wei Ke, Stella X. Yu

{"title":"Contextual Image Parsing via Panoptic Segment Sorting","authors":"Jyh-Jing Hwang, Tsung-Wei Ke, Stella X. Yu","doi":"10.1145/3476098.3485056","DOIUrl":null,"url":null,"abstract":"Real-world visual recognition is far more complex than object recognition; there is stuff without distinctive shape or appearance, and the same object appearing in different contexts calls for different actions. While we need context-aware visual recognition, visual context is hard to describe and impossible to label manually. We consider visual context as semantic correlations between objects and their surroundings that include both object instances and stuff categories. We approach contextual object recognition as a pixel-wise feature representation learning problem that accomplishes supervised panoptic segmentation while discovering and encoding visual context automatically. Panoptic segmentation is a dense image parsing task that segments an image into regions with both semantic category and object instance labels. These two aspects could conflict each other, for two adjacent cars would have the same semantic label but different instance labels. Whereas most existing approaches handle the two labeling tasks separately and then fuse the results together, we propose a single pixel-wise feature learning approach that unifies both aspects of semantic segmentation and instance segmentation. Our work takes the metric learning perspective of SegSort but extends it non-trivially to panoptic segmentation, as we must merge segments into proper instances and handle instances of various scales. Our most exciting result is the emergence of visual context in the feature space through contrastive learning between pixels and segments, such that we can retrieve a person crossing a somewhat empty street without any such context labeling. Our experimental results on Cityscapes and PASCAL VOC demonstrate that, in terms of surround semantics distributions, our retrievals are much more consistent with the query than the state-of-the-art segmentation method, validating our pixel-wise representation learning approach for the unsupervised discovery and learning of visual context.","PeriodicalId":390904,"journal":{"name":"Multimedia Understanding with Less Labeling on Multimedia Understanding with Less Labeling","volume":"45 2","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-05-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Multimedia Understanding with Less Labeling on Multimedia Understanding with Less Labeling","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3476098.3485056","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

Abstract

Real-world visual recognition is far more complex than object recognition; there is stuff without distinctive shape or appearance, and the same object appearing in different contexts calls for different actions. While we need context-aware visual recognition, visual context is hard to describe and impossible to label manually. We consider visual context as semantic correlations between objects and their surroundings that include both object instances and stuff categories. We approach contextual object recognition as a pixel-wise feature representation learning problem that accomplishes supervised panoptic segmentation while discovering and encoding visual context automatically. Panoptic segmentation is a dense image parsing task that segments an image into regions with both semantic category and object instance labels. These two aspects could conflict each other, for two adjacent cars would have the same semantic label but different instance labels. Whereas most existing approaches handle the two labeling tasks separately and then fuse the results together, we propose a single pixel-wise feature learning approach that unifies both aspects of semantic segmentation and instance segmentation. Our work takes the metric learning perspective of SegSort but extends it non-trivially to panoptic segmentation, as we must merge segments into proper instances and handle instances of various scales. Our most exciting result is the emergence of visual context in the feature space through contrastive learning between pixels and segments, such that we can retrieve a person crossing a somewhat empty street without any such context labeling. Our experimental results on Cityscapes and PASCAL VOC demonstrate that, in terms of surround semantics distributions, our retrievals are much more consistent with the query than the state-of-the-art segmentation method, validating our pixel-wise representation learning approach for the unsupervised discovery and learning of visual context.

查看原文本刊更多论文

基于Panoptic分割排序的上下文图像解析

现实世界的视觉识别远比物体识别复杂;有些东西没有独特的形状或外观，同样的物体出现在不同的环境中需要不同的行动。虽然我们需要上下文感知的视觉识别，但视觉上下文很难描述，也不可能手动标记。我们认为视觉上下文是对象及其周围环境之间的语义相关性，包括对象实例和材料类别。我们将上下文对象识别作为一个像素级特征表示学习问题，在自动发现和编码视觉上下文的同时完成监督全视分割。全视分割是一种密集的图像分析任务，它将图像分割成具有语义类别和对象实例标签的区域。这两个方面可能相互冲突，因为相邻的两个汽车可能具有相同的语义标签，但实例标签不同。鉴于大多数现有方法分别处理两种标记任务，然后将结果融合在一起，我们提出了一种统一语义分割和实例分割两方面的单像素特征学习方法。我们的工作采用了SegSort的度量学习视角，但将其扩展到非平凡的全景分割，因为我们必须将片段合并到适当的实例中并处理各种规模的实例。我们最令人兴奋的结果是，通过像素和片段之间的对比学习，在特征空间中出现了视觉上下文，这样我们就可以在没有任何上下文标记的情况下检索到一个人穿过一条有点空的街道。我们在cityscape和PASCAL VOC上的实验结果表明，就环绕语义分布而言，我们的检索结果比最先进的分割方法更符合查询，验证了我们的像素表示学习方法用于无监督发现和学习视觉上下文。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Multimedia Understanding with Less Labeling on Multimedia Understanding with Less Labeling

自引率

0.00%

发文量