部分:无监督分割槽，注意力和独立性最大化

2021 IEEE/CVF International Conference on Computer Vision (ICCV) Pub Date : 2021-10-01 DOI:10.1109/ICCV48922.2021.01027

Daniel Zoran, Rishabh Kabra, Alexander Lerchner, Danilo Jimenez Rezende

{"title":"部分:无监督分割槽，注意力和独立性最大化","authors":"Daniel Zoran, Rishabh Kabra, Alexander Lerchner, Danilo Jimenez Rezende","doi":"10.1109/ICCV48922.2021.01027","DOIUrl":null,"url":null,"abstract":"From an early age, humans perceive the visual world as composed of coherent objects with distinctive properties such as shape, size, and color. There is great interest in building models that are able to learn similar structure, ideally in an unsupervised manner. Learning such structure from complex 3D scenes that include clutter, occlusions, interactions, and camera motion is still an open challenge. We present a model that is able to segment visual scenes from complex 3D environments into distinct objects, learn disentangled representations of individual objects, and form consistent and coherent predictions of future frames, in a fully unsupervised manner. Our model (named PARTS) builds on recent approaches that utilize iterative amortized inference and transition dynamics for deep generative models. We achieve dramatic improvements in performance by introducing several novel contributions. We introduce a recurrent slot-attention like encoder which allows for top-down influence during inference. We argue that when inferring scene structure from image sequences it is better to use a fixed prior which is shared across the sequence rather than an auto-regressive prior as often used in prior work. We demonstrate our model’s success on three different video datasets (the popular benchmark CLEVRER; a simulated 3D Playroom environment; and a real-world Robotics Arm dataset). Finally, we analyze the contributions of the various model components and the representations learned by the model.","PeriodicalId":6820,"journal":{"name":"2021 IEEE/CVF International Conference on Computer Vision (ICCV)","volume":"16 1","pages":"10419-10427"},"PeriodicalIF":0.0000,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"34","resultStr":"{\"title\":\"PARTS: Unsupervised segmentation with slots, attention and independence maximization\",\"authors\":\"Daniel Zoran, Rishabh Kabra, Alexander Lerchner, Danilo Jimenez Rezende\",\"doi\":\"10.1109/ICCV48922.2021.01027\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"From an early age, humans perceive the visual world as composed of coherent objects with distinctive properties such as shape, size, and color. There is great interest in building models that are able to learn similar structure, ideally in an unsupervised manner. Learning such structure from complex 3D scenes that include clutter, occlusions, interactions, and camera motion is still an open challenge. We present a model that is able to segment visual scenes from complex 3D environments into distinct objects, learn disentangled representations of individual objects, and form consistent and coherent predictions of future frames, in a fully unsupervised manner. Our model (named PARTS) builds on recent approaches that utilize iterative amortized inference and transition dynamics for deep generative models. We achieve dramatic improvements in performance by introducing several novel contributions. We introduce a recurrent slot-attention like encoder which allows for top-down influence during inference. We argue that when inferring scene structure from image sequences it is better to use a fixed prior which is shared across the sequence rather than an auto-regressive prior as often used in prior work. We demonstrate our model’s success on three different video datasets (the popular benchmark CLEVRER; a simulated 3D Playroom environment; and a real-world Robotics Arm dataset). Finally, we analyze the contributions of the various model components and the representations learned by the model.\",\"PeriodicalId\":6820,\"journal\":{\"name\":\"2021 IEEE/CVF International Conference on Computer Vision (ICCV)\",\"volume\":\"16 1\",\"pages\":\"10419-10427\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-10-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"34\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2021 IEEE/CVF International Conference on Computer Vision (ICCV)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICCV48922.2021.01027\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 IEEE/CVF International Conference on Computer Vision (ICCV)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICCV48922.2021.01027","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 34

摘要

从很小的时候起，人类就认为视觉世界是由形状、大小和颜色等不同属性的连贯物体组成的。人们对建立能够学习类似结构的模型非常感兴趣，理想情况下是以无监督的方式。从复杂的3D场景中学习这种结构，包括杂乱，遮挡，相互作用和相机运动仍然是一个开放的挑战。我们提出了一个模型，该模型能够将复杂的3D环境中的视觉场景分割成不同的对象，学习单个对象的解纠缠表示，并以完全无监督的方式形成对未来框架的一致和连贯的预测。我们的模型(命名为PARTS)建立在最近的方法之上，这些方法利用迭代平摊推理和深度生成模型的转换动力学。通过引入一些新颖的贡献，我们实现了性能的显著改进。我们引入了一个类似于循环槽注意的编码器，它允许在推理过程中自上而下的影响。我们认为，当从图像序列推断场景结构时，最好使用在整个序列中共享的固定先验，而不是像以前的工作中经常使用的自回归先验。我们在三个不同的视频数据集上展示了我们的模型的成功(流行的基准clever;模拟的3D游戏室环境;和真实世界的机器人手臂数据集)。最后，我们分析了各种模型组件的贡献以及模型学习到的表示。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

PARTS: Unsupervised segmentation with slots, attention and independence maximization

From an early age, humans perceive the visual world as composed of coherent objects with distinctive properties such as shape, size, and color. There is great interest in building models that are able to learn similar structure, ideally in an unsupervised manner. Learning such structure from complex 3D scenes that include clutter, occlusions, interactions, and camera motion is still an open challenge. We present a model that is able to segment visual scenes from complex 3D environments into distinct objects, learn disentangled representations of individual objects, and form consistent and coherent predictions of future frames, in a fully unsupervised manner. Our model (named PARTS) builds on recent approaches that utilize iterative amortized inference and transition dynamics for deep generative models. We achieve dramatic improvements in performance by introducing several novel contributions. We introduce a recurrent slot-attention like encoder which allows for top-down influence during inference. We argue that when inferring scene structure from image sequences it is better to use a fixed prior which is shared across the sequence rather than an auto-regressive prior as often used in prior work. We demonstrate our model’s success on three different video datasets (the popular benchmark CLEVRER; a simulated 3D Playroom environment; and a real-world Robotics Arm dataset). Finally, we analyze the contributions of the various model components and the representations learned by the model.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2021 IEEE/CVF International Conference on Computer Vision (ICCV)

自引率

0.00%

发文量