ASIMO：多物体操纵中以代理为中心的场景表示法

The International Journal of Robotics Research Pub Date : 2024-06-10 DOI:10.1177/02783649241257537

Cheol-Hui Min, Young Min Kim

{"title":"ASIMO：多物体操纵中以代理为中心的场景表示法","authors":"Cheol-Hui Min, Young Min Kim","doi":"10.1177/02783649241257537","DOIUrl":null,"url":null,"abstract":"Vision-based reinforcement learning (RL) is a generalizable way to control an agent because it is agnostic of specific hardware configurations. As visual observations are highly entangled, attempts for vision-based RL rely on scene representation that discerns individual entities and establishes intuitive physics to constitute the world model. However, most existing works on scene representation learning cannot successfully be deployed to train an RL agent, as they are often highly unstable and fail to sustain for a long enough temporal horizon. We propose ASIMO, a fully unsupervised scene decomposition to perform interaction-rich tasks with a vision-based RL agent. ASIMO decomposes agent-object interaction videos of episodic-length into the agent, objects, and background, predicting their long-term interactions. Further, we explicitly model possible occlusion in the image observations and stably track individual objects. Then, we can correctly deduce the updated positions of individual entities in response to the agent action, only from partial visual observation. Based on the stable entity-wise decomposition and temporal prediction, we formulate a hierarchical framework to train the RL agent that focuses on the context around the object of interest. We demonstrate that our formulation for scene representation can be universally deployed to train different configurations of agents and accomplish several tasks that involve pushing, arranging, and placing multiple rigid objects.","PeriodicalId":501362,"journal":{"name":"The International Journal of Robotics Research","volume":" 8","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-06-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"ASIMO: Agent-centric scene representation in multi-object manipulation\",\"authors\":\"Cheol-Hui Min, Young Min Kim\",\"doi\":\"10.1177/02783649241257537\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Vision-based reinforcement learning (RL) is a generalizable way to control an agent because it is agnostic of specific hardware configurations. As visual observations are highly entangled, attempts for vision-based RL rely on scene representation that discerns individual entities and establishes intuitive physics to constitute the world model. However, most existing works on scene representation learning cannot successfully be deployed to train an RL agent, as they are often highly unstable and fail to sustain for a long enough temporal horizon. We propose ASIMO, a fully unsupervised scene decomposition to perform interaction-rich tasks with a vision-based RL agent. ASIMO decomposes agent-object interaction videos of episodic-length into the agent, objects, and background, predicting their long-term interactions. Further, we explicitly model possible occlusion in the image observations and stably track individual objects. Then, we can correctly deduce the updated positions of individual entities in response to the agent action, only from partial visual observation. Based on the stable entity-wise decomposition and temporal prediction, we formulate a hierarchical framework to train the RL agent that focuses on the context around the object of interest. We demonstrate that our formulation for scene representation can be universally deployed to train different configurations of agents and accomplish several tasks that involve pushing, arranging, and placing multiple rigid objects.\",\"PeriodicalId\":501362,\"journal\":{\"name\":\"The International Journal of Robotics Research\",\"volume\":\" 8\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-06-10\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"The International Journal of Robotics Research\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1177/02783649241257537\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"The International Journal of Robotics Research","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1177/02783649241257537","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

基于视觉的强化学习（RL）是一种控制代理的通用方法，因为它与特定的硬件配置无关。由于视觉观察是高度纠缠的，因此基于视觉的强化学习依赖于场景表示法，它能辨别单个实体并建立直观的物理模型，从而构成世界模型。然而，大多数现有的场景表征学习工作都无法成功地用于训练 RL 代理，因为它们通常非常不稳定，无法维持足够长的时间跨度。我们提出的 ASIMO 是一种完全无监督的场景分解方法，用于与基于视觉的 RL 代理执行交互丰富的任务。ASIMO 将偶发长度的代理-对象交互视频分解为代理、对象和背景，并预测它们之间的长期交互。此外，我们对图像观测中可能存在的遮挡进行了明确建模，并对单个物体进行了稳定跟踪。这样，我们就能仅通过部分视觉观察，正确推断出单个实体响应代理动作的更新位置。在稳定的实体分解和时间预测的基础上，我们制定了一个分层框架来训练 RL 代理，该代理关注感兴趣物体周围的环境。我们证明，我们的场景表示方法可以普遍用于训练不同配置的代理，并完成涉及推动、排列和放置多个刚性物体的多项任务。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

ASIMO: Agent-centric scene representation in multi-object manipulation

Vision-based reinforcement learning (RL) is a generalizable way to control an agent because it is agnostic of specific hardware configurations. As visual observations are highly entangled, attempts for vision-based RL rely on scene representation that discerns individual entities and establishes intuitive physics to constitute the world model. However, most existing works on scene representation learning cannot successfully be deployed to train an RL agent, as they are often highly unstable and fail to sustain for a long enough temporal horizon. We propose ASIMO, a fully unsupervised scene decomposition to perform interaction-rich tasks with a vision-based RL agent. ASIMO decomposes agent-object interaction videos of episodic-length into the agent, objects, and background, predicting their long-term interactions. Further, we explicitly model possible occlusion in the image observations and stably track individual objects. Then, we can correctly deduce the updated positions of individual entities in response to the agent action, only from partial visual observation. Based on the stable entity-wise decomposition and temporal prediction, we formulate a hierarchical framework to train the RL agent that focuses on the context around the object of interest. We demonstrate that our formulation for scene representation can be universally deployed to train different configurations of agents and accomplish several tasks that involve pushing, arranging, and placing multiple rigid objects.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

The International Journal of Robotics Research

自引率

0.00%

发文量