{"title":"InvSlotGNN: Unsupervised Discovery of Viewpoint Invariant Multiobject Representations and Visual Dynamics","authors":"Alireza Rezazadeh;Houjian Yu;Karthik Desingh;Changhyun Choi","doi":"10.1109/TRO.2025.3543274","DOIUrl":null,"url":null,"abstract":"Learning multiobject dynamics purely from visual data is challenging due to the need for robust object representations that can be learned through robot interactions. In previous work (Rezazadeh et al., 2023), we introduced two novel architectures: SlotTransport for discovering object-centric representations from singleview RGB images, referred to as slots, and SlotGNN for predicting scene dynamics from singleview RGB images and robot interactions using the discovered slots. This article introduces InvSlotGNN, a novel framework for learning multiview slot discovery and dynamics that are invariant to the camera viewpoint. First, we demonstrate that SlotTransport can be trained on multiview data such that a single model discovers temporally aligned, object-centric representations from a wide range of different camera angles. These slots bind to objects from various viewpoints, even under occlusion or absence. Next, we introduce InvSlotGNN, an extension of SlotGNN, that learns multiobject dynamics invariant to the camera angle and predicts the future state from observations taken by uncalibrated cameras. InvSlotGNN learns a graph representation of the scene using the slots from SlotTransport and performs relational and spatial reasoning to predict the future state of the scene for arbitrary viewpoints, conditioned on robot actions. We demonstrate the effectiveness of SlotTransport in learning multiview object-centric features that accurately encode visual and positional information. Furthermore, we highlight the accuracy of InvSlotGNN in downstream robotic tasks, including long-horizon prediction and multiobject rearrangement. Finally, with minimal real data, our framework robustly predicts slots and their dynamics in real-world multiview scenarios.","PeriodicalId":50388,"journal":{"name":"IEEE Transactions on Robotics","volume":"41 ","pages":"1812-1824"},"PeriodicalIF":9.4000,"publicationDate":"2025-02-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Robotics","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10891822/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ROBOTICS","Score":null,"Total":0}
引用次数: 0
Abstract
Learning multiobject dynamics purely from visual data is challenging due to the need for robust object representations that can be learned through robot interactions. In previous work (Rezazadeh et al., 2023), we introduced two novel architectures: SlotTransport for discovering object-centric representations from singleview RGB images, referred to as slots, and SlotGNN for predicting scene dynamics from singleview RGB images and robot interactions using the discovered slots. This article introduces InvSlotGNN, a novel framework for learning multiview slot discovery and dynamics that are invariant to the camera viewpoint. First, we demonstrate that SlotTransport can be trained on multiview data such that a single model discovers temporally aligned, object-centric representations from a wide range of different camera angles. These slots bind to objects from various viewpoints, even under occlusion or absence. Next, we introduce InvSlotGNN, an extension of SlotGNN, that learns multiobject dynamics invariant to the camera angle and predicts the future state from observations taken by uncalibrated cameras. InvSlotGNN learns a graph representation of the scene using the slots from SlotTransport and performs relational and spatial reasoning to predict the future state of the scene for arbitrary viewpoints, conditioned on robot actions. We demonstrate the effectiveness of SlotTransport in learning multiview object-centric features that accurately encode visual and positional information. Furthermore, we highlight the accuracy of InvSlotGNN in downstream robotic tasks, including long-horizon prediction and multiobject rearrangement. Finally, with minimal real data, our framework robustly predicts slots and their dynamics in real-world multiview scenarios.
纯粹从视觉数据中学习多对象动力学是具有挑战性的,因为需要通过机器人交互学习的鲁棒对象表示。在之前的工作中(Rezazadeh et al., 2023),我们介绍了两种新的架构:用于从单视图RGB图像中发现以对象为中心的表示的SlotTransport,称为插槽,以及用于从单视图RGB图像和使用发现的插槽预测场景动态和机器人交互的SlotGNN。这篇文章介绍了InvSlotGNN,这是一个学习多视图槽发现和动态的新框架,它对摄像机视点是不变的。首先,我们证明了SlotTransport可以在多视图数据上进行训练,这样单个模型就可以从各种不同的相机角度发现时间对齐的、以对象为中心的表示。这些插槽从不同的角度绑定对象,即使在遮挡或缺席的情况下。接下来,我们介绍了SlotGNN的扩展,即InvSlotGNN,它学习多目标动态对相机角度的不变性,并从未校准相机的观测结果中预测未来状态。InvSlotGNN使用SlotTransport中的槽学习场景的图形表示,并执行关系和空间推理,以任意视点预测场景的未来状态,条件是机器人的动作。我们证明了SlotTransport在学习多视图对象中心特征方面的有效性,这些特征可以准确地编码视觉和位置信息。此外,我们强调了InvSlotGNN在下游机器人任务中的准确性,包括长视界预测和多目标重排。最后,使用最小的真实数据,我们的框架在现实世界的多视图场景中稳健地预测槽及其动态。
期刊介绍:
The IEEE Transactions on Robotics (T-RO) is dedicated to publishing fundamental papers covering all facets of robotics, drawing on interdisciplinary approaches from computer science, control systems, electrical engineering, mathematics, mechanical engineering, and beyond. From industrial applications to service and personal assistants, surgical operations to space, underwater, and remote exploration, robots and intelligent machines play pivotal roles across various domains, including entertainment, safety, search and rescue, military applications, agriculture, and intelligent vehicles.
Special emphasis is placed on intelligent machines and systems designed for unstructured environments, where a significant portion of the environment remains unknown and beyond direct sensing or control.