PARTS: Unsupervised segmentation with slots, attention and independence maximization

Daniel Zoran, Rishabh Kabra, Alexander Lerchner, Danilo Jimenez Rezende
{"title":"PARTS: Unsupervised segmentation with slots, attention and independence maximization","authors":"Daniel Zoran, Rishabh Kabra, Alexander Lerchner, Danilo Jimenez Rezende","doi":"10.1109/ICCV48922.2021.01027","DOIUrl":null,"url":null,"abstract":"From an early age, humans perceive the visual world as composed of coherent objects with distinctive properties such as shape, size, and color. There is great interest in building models that are able to learn similar structure, ideally in an unsupervised manner. Learning such structure from complex 3D scenes that include clutter, occlusions, interactions, and camera motion is still an open challenge. We present a model that is able to segment visual scenes from complex 3D environments into distinct objects, learn disentangled representations of individual objects, and form consistent and coherent predictions of future frames, in a fully unsupervised manner. Our model (named PARTS) builds on recent approaches that utilize iterative amortized inference and transition dynamics for deep generative models. We achieve dramatic improvements in performance by introducing several novel contributions. We introduce a recurrent slot-attention like encoder which allows for top-down influence during inference. We argue that when inferring scene structure from image sequences it is better to use a fixed prior which is shared across the sequence rather than an auto-regressive prior as often used in prior work. We demonstrate our model’s success on three different video datasets (the popular benchmark CLEVRER; a simulated 3D Playroom environment; and a real-world Robotics Arm dataset). Finally, we analyze the contributions of the various model components and the representations learned by the model.","PeriodicalId":6820,"journal":{"name":"2021 IEEE/CVF International Conference on Computer Vision (ICCV)","volume":"16 1","pages":"10419-10427"},"PeriodicalIF":0.0000,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"34","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 IEEE/CVF International Conference on Computer Vision (ICCV)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICCV48922.2021.01027","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 34

Abstract

From an early age, humans perceive the visual world as composed of coherent objects with distinctive properties such as shape, size, and color. There is great interest in building models that are able to learn similar structure, ideally in an unsupervised manner. Learning such structure from complex 3D scenes that include clutter, occlusions, interactions, and camera motion is still an open challenge. We present a model that is able to segment visual scenes from complex 3D environments into distinct objects, learn disentangled representations of individual objects, and form consistent and coherent predictions of future frames, in a fully unsupervised manner. Our model (named PARTS) builds on recent approaches that utilize iterative amortized inference and transition dynamics for deep generative models. We achieve dramatic improvements in performance by introducing several novel contributions. We introduce a recurrent slot-attention like encoder which allows for top-down influence during inference. We argue that when inferring scene structure from image sequences it is better to use a fixed prior which is shared across the sequence rather than an auto-regressive prior as often used in prior work. We demonstrate our model’s success on three different video datasets (the popular benchmark CLEVRER; a simulated 3D Playroom environment; and a real-world Robotics Arm dataset). Finally, we analyze the contributions of the various model components and the representations learned by the model.
部分:无监督分割槽,注意力和独立性最大化
从很小的时候起,人类就认为视觉世界是由形状、大小和颜色等不同属性的连贯物体组成的。人们对建立能够学习类似结构的模型非常感兴趣,理想情况下是以无监督的方式。从复杂的3D场景中学习这种结构,包括杂乱,遮挡,相互作用和相机运动仍然是一个开放的挑战。我们提出了一个模型,该模型能够将复杂的3D环境中的视觉场景分割成不同的对象,学习单个对象的解纠缠表示,并以完全无监督的方式形成对未来框架的一致和连贯的预测。我们的模型(命名为PARTS)建立在最近的方法之上,这些方法利用迭代平摊推理和深度生成模型的转换动力学。通过引入一些新颖的贡献,我们实现了性能的显著改进。我们引入了一个类似于循环槽注意的编码器,它允许在推理过程中自上而下的影响。我们认为,当从图像序列推断场景结构时,最好使用在整个序列中共享的固定先验,而不是像以前的工作中经常使用的自回归先验。我们在三个不同的视频数据集上展示了我们的模型的成功(流行的基准clever;模拟的3D游戏室环境;和真实世界的机器人手臂数据集)。最后,我们分析了各种模型组件的贡献以及模型学习到的表示。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信