InvSlotGNN: Unsupervised Discovery of Viewpoint Invariant Multiobject Representations and Visual Dynamics

IF 9.4 1区 计算机科学 Q1 ROBOTICS
Alireza Rezazadeh;Houjian Yu;Karthik Desingh;Changhyun Choi
{"title":"InvSlotGNN: Unsupervised Discovery of Viewpoint Invariant Multiobject Representations and Visual Dynamics","authors":"Alireza Rezazadeh;Houjian Yu;Karthik Desingh;Changhyun Choi","doi":"10.1109/TRO.2025.3543274","DOIUrl":null,"url":null,"abstract":"Learning multiobject dynamics purely from visual data is challenging due to the need for robust object representations that can be learned through robot interactions. In previous work (Rezazadeh et al., 2023), we introduced two novel architectures: SlotTransport for discovering object-centric representations from singleview RGB images, referred to as slots, and SlotGNN for predicting scene dynamics from singleview RGB images and robot interactions using the discovered slots. This article introduces InvSlotGNN, a novel framework for learning multiview slot discovery and dynamics that are invariant to the camera viewpoint. First, we demonstrate that SlotTransport can be trained on multiview data such that a single model discovers temporally aligned, object-centric representations from a wide range of different camera angles. These slots bind to objects from various viewpoints, even under occlusion or absence. Next, we introduce InvSlotGNN, an extension of SlotGNN, that learns multiobject dynamics invariant to the camera angle and predicts the future state from observations taken by uncalibrated cameras. InvSlotGNN learns a graph representation of the scene using the slots from SlotTransport and performs relational and spatial reasoning to predict the future state of the scene for arbitrary viewpoints, conditioned on robot actions. We demonstrate the effectiveness of SlotTransport in learning multiview object-centric features that accurately encode visual and positional information. Furthermore, we highlight the accuracy of InvSlotGNN in downstream robotic tasks, including long-horizon prediction and multiobject rearrangement. Finally, with minimal real data, our framework robustly predicts slots and their dynamics in real-world multiview scenarios.","PeriodicalId":50388,"journal":{"name":"IEEE Transactions on Robotics","volume":"41 ","pages":"1812-1824"},"PeriodicalIF":9.4000,"publicationDate":"2025-02-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Robotics","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10891822/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ROBOTICS","Score":null,"Total":0}
引用次数: 0

Abstract

Learning multiobject dynamics purely from visual data is challenging due to the need for robust object representations that can be learned through robot interactions. In previous work (Rezazadeh et al., 2023), we introduced two novel architectures: SlotTransport for discovering object-centric representations from singleview RGB images, referred to as slots, and SlotGNN for predicting scene dynamics from singleview RGB images and robot interactions using the discovered slots. This article introduces InvSlotGNN, a novel framework for learning multiview slot discovery and dynamics that are invariant to the camera viewpoint. First, we demonstrate that SlotTransport can be trained on multiview data such that a single model discovers temporally aligned, object-centric representations from a wide range of different camera angles. These slots bind to objects from various viewpoints, even under occlusion or absence. Next, we introduce InvSlotGNN, an extension of SlotGNN, that learns multiobject dynamics invariant to the camera angle and predicts the future state from observations taken by uncalibrated cameras. InvSlotGNN learns a graph representation of the scene using the slots from SlotTransport and performs relational and spatial reasoning to predict the future state of the scene for arbitrary viewpoints, conditioned on robot actions. We demonstrate the effectiveness of SlotTransport in learning multiview object-centric features that accurately encode visual and positional information. Furthermore, we highlight the accuracy of InvSlotGNN in downstream robotic tasks, including long-horizon prediction and multiobject rearrangement. Finally, with minimal real data, our framework robustly predicts slots and their dynamics in real-world multiview scenarios.
视点不变多目标表示和视觉动力学的无监督发现
纯粹从视觉数据中学习多对象动力学是具有挑战性的,因为需要通过机器人交互学习的鲁棒对象表示。在之前的工作中(Rezazadeh et al., 2023),我们介绍了两种新的架构:用于从单视图RGB图像中发现以对象为中心的表示的SlotTransport,称为插槽,以及用于从单视图RGB图像和使用发现的插槽预测场景动态和机器人交互的SlotGNN。这篇文章介绍了InvSlotGNN,这是一个学习多视图槽发现和动态的新框架,它对摄像机视点是不变的。首先,我们证明了SlotTransport可以在多视图数据上进行训练,这样单个模型就可以从各种不同的相机角度发现时间对齐的、以对象为中心的表示。这些插槽从不同的角度绑定对象,即使在遮挡或缺席的情况下。接下来,我们介绍了SlotGNN的扩展,即InvSlotGNN,它学习多目标动态对相机角度的不变性,并从未校准相机的观测结果中预测未来状态。InvSlotGNN使用SlotTransport中的槽学习场景的图形表示,并执行关系和空间推理,以任意视点预测场景的未来状态,条件是机器人的动作。我们证明了SlotTransport在学习多视图对象中心特征方面的有效性,这些特征可以准确地编码视觉和位置信息。此外,我们强调了InvSlotGNN在下游机器人任务中的准确性,包括长视界预测和多目标重排。最后,使用最小的真实数据,我们的框架在现实世界的多视图场景中稳健地预测槽及其动态。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
IEEE Transactions on Robotics
IEEE Transactions on Robotics 工程技术-机器人学
CiteScore
14.90
自引率
5.10%
发文量
259
审稿时长
6.0 months
期刊介绍: The IEEE Transactions on Robotics (T-RO) is dedicated to publishing fundamental papers covering all facets of robotics, drawing on interdisciplinary approaches from computer science, control systems, electrical engineering, mathematics, mechanical engineering, and beyond. From industrial applications to service and personal assistants, surgical operations to space, underwater, and remote exploration, robots and intelligent machines play pivotal roles across various domains, including entertainment, safety, search and rescue, military applications, agriculture, and intelligent vehicles. Special emphasis is placed on intelligent machines and systems designed for unstructured environments, where a significant portion of the environment remains unknown and beyond direct sensing or control.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信