Compositional scene modeling with global object-centric representations

IF 4.3 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Machine Learning Pub Date : 2024-01-11 DOI:10.1007/s10994-023-06419-5

{"title":"Compositional scene modeling with global object-centric representations","authors":"","doi":"10.1007/s10994-023-06419-5","DOIUrl":null,"url":null,"abstract":"<h3>Abstract</h3> <p>The appearance of the same object may vary in different scene images due to occlusions between objects. Humans can quickly identify the same object, even if occlusions exist, by completing the occluded parts based on its complete canonical image in the memory. Achieving this ability is still challenging for existing models, especially in the unsupervised learning setting. Inspired by such an ability of humans, we propose a novel object-centric representation learning method to identify the same object in different scenes that may be occluded by learning global object-centric representations of complete canonical objects without supervision. The representation of each object is divided into an extrinsic part, which characterizes scene-dependent information (i.e., position and size), and an intrinsic part, which characterizes globally invariant information (i.e., appearance and shape). The former can be inferred with an improved IC-SBP module. The latter is extracted by combining rectangular and arbitrary-shaped attention and is used to infer the identity representation via a proposed patch-matching strategy with a set of learnable global object-centric representations of complete canonical objects. In the experiment, three 2D scene datasets are used to verify the proposed method’s ability to recognize the identity of the same object in different scenes. A complex 3D scene dataset and a real-world dataset are used to evaluate the performance of scene decomposition. Our experimental results demonstrate that the proposed method outperforms the comparison methods in terms of same object recognition and scene decomposition.</p>","PeriodicalId":49900,"journal":{"name":"Machine Learning","volume":"14 1","pages":""},"PeriodicalIF":4.3000,"publicationDate":"2024-01-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Machine Learning","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s10994-023-06419-5","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

The appearance of the same object may vary in different scene images due to occlusions between objects. Humans can quickly identify the same object, even if occlusions exist, by completing the occluded parts based on its complete canonical image in the memory. Achieving this ability is still challenging for existing models, especially in the unsupervised learning setting. Inspired by such an ability of humans, we propose a novel object-centric representation learning method to identify the same object in different scenes that may be occluded by learning global object-centric representations of complete canonical objects without supervision. The representation of each object is divided into an extrinsic part, which characterizes scene-dependent information (i.e., position and size), and an intrinsic part, which characterizes globally invariant information (i.e., appearance and shape). The former can be inferred with an improved IC-SBP module. The latter is extracted by combining rectangular and arbitrary-shaped attention and is used to infer the identity representation via a proposed patch-matching strategy with a set of learnable global object-centric representations of complete canonical objects. In the experiment, three 2D scene datasets are used to verify the proposed method’s ability to recognize the identity of the same object in different scenes. A complex 3D scene dataset and a real-world dataset are used to evaluate the performance of scene decomposition. Our experimental results demonstrate that the proposed method outperforms the comparison methods in terms of same object recognition and scene decomposition.

查看原文本刊更多论文

以全局对象为中心的合成场景建模

摘要在不同的场景图像中，由于物体之间的遮挡，同一物体的外观可能会有所不同。即使存在遮挡，人类也能根据记忆中的完整典型图像补全遮挡部分，从而快速识别同一物体。对于现有模型来说，尤其是在无监督学习环境下，实现这种能力仍是一项挑战。受人类这种能力的启发，我们提出了一种新颖的以对象为中心的表征学习方法，通过学习完整的典型对象的全局对象为中心的表征，在没有监督的情况下识别可能被遮挡的不同场景中的同一对象。每个物体的表征分为外在部分和内在部分，外在部分表征与场景相关的信息（即位置和大小），内在部分表征全局不变的信息（即外观和形状）。前者可以通过改进的 IC-SBP 模块来推断。后者通过结合矩形和任意形状的注意力来提取，并通过建议的补丁匹配策略与一套可学习的以完整典型对象为中心的全局对象表征来推断身份表征。在实验中，使用了三个二维场景数据集来验证所提出的方法在不同场景中识别同一物体身份的能力。一个复杂的三维场景数据集和一个真实世界数据集用于评估场景分解的性能。实验结果表明，在同一物体识别和场景分解方面，所提出的方法优于对比方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Machine Learning 工程技术-计算机：人工智能

CiteScore

11.00

自引率

2.70%

发文量

162

审稿时长

3 months

期刊介绍： Machine Learning serves as a global platform dedicated to computational approaches in learning. The journal reports substantial findings on diverse learning methods applied to various problems, offering support through empirical studies, theoretical analysis, or connections to psychological phenomena. It demonstrates the application of learning methods to solve significant problems and aims to enhance the conduct of machine learning research with a focus on verifiable and replicable evidence in published papers.