{"title":"ASIMO:多物体操纵中以代理为中心的场景表示法","authors":"Cheol-Hui Min, Young Min Kim","doi":"10.1177/02783649241257537","DOIUrl":null,"url":null,"abstract":"Vision-based reinforcement learning (RL) is a generalizable way to control an agent because it is agnostic of specific hardware configurations. As visual observations are highly entangled, attempts for vision-based RL rely on scene representation that discerns individual entities and establishes intuitive physics to constitute the world model. However, most existing works on scene representation learning cannot successfully be deployed to train an RL agent, as they are often highly unstable and fail to sustain for a long enough temporal horizon. We propose ASIMO, a fully unsupervised scene decomposition to perform interaction-rich tasks with a vision-based RL agent. ASIMO decomposes agent-object interaction videos of episodic-length into the agent, objects, and background, predicting their long-term interactions. Further, we explicitly model possible occlusion in the image observations and stably track individual objects. Then, we can correctly deduce the updated positions of individual entities in response to the agent action, only from partial visual observation. Based on the stable entity-wise decomposition and temporal prediction, we formulate a hierarchical framework to train the RL agent that focuses on the context around the object of interest. We demonstrate that our formulation for scene representation can be universally deployed to train different configurations of agents and accomplish several tasks that involve pushing, arranging, and placing multiple rigid objects.","PeriodicalId":501362,"journal":{"name":"The International Journal of Robotics Research","volume":" 8","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-06-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"ASIMO: Agent-centric scene representation in multi-object manipulation\",\"authors\":\"Cheol-Hui Min, Young Min Kim\",\"doi\":\"10.1177/02783649241257537\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Vision-based reinforcement learning (RL) is a generalizable way to control an agent because it is agnostic of specific hardware configurations. As visual observations are highly entangled, attempts for vision-based RL rely on scene representation that discerns individual entities and establishes intuitive physics to constitute the world model. However, most existing works on scene representation learning cannot successfully be deployed to train an RL agent, as they are often highly unstable and fail to sustain for a long enough temporal horizon. We propose ASIMO, a fully unsupervised scene decomposition to perform interaction-rich tasks with a vision-based RL agent. ASIMO decomposes agent-object interaction videos of episodic-length into the agent, objects, and background, predicting their long-term interactions. Further, we explicitly model possible occlusion in the image observations and stably track individual objects. Then, we can correctly deduce the updated positions of individual entities in response to the agent action, only from partial visual observation. Based on the stable entity-wise decomposition and temporal prediction, we formulate a hierarchical framework to train the RL agent that focuses on the context around the object of interest. We demonstrate that our formulation for scene representation can be universally deployed to train different configurations of agents and accomplish several tasks that involve pushing, arranging, and placing multiple rigid objects.\",\"PeriodicalId\":501362,\"journal\":{\"name\":\"The International Journal of Robotics Research\",\"volume\":\" 8\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-06-10\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"The International Journal of Robotics Research\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1177/02783649241257537\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"The International Journal of Robotics Research","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1177/02783649241257537","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
ASIMO: Agent-centric scene representation in multi-object manipulation
Vision-based reinforcement learning (RL) is a generalizable way to control an agent because it is agnostic of specific hardware configurations. As visual observations are highly entangled, attempts for vision-based RL rely on scene representation that discerns individual entities and establishes intuitive physics to constitute the world model. However, most existing works on scene representation learning cannot successfully be deployed to train an RL agent, as they are often highly unstable and fail to sustain for a long enough temporal horizon. We propose ASIMO, a fully unsupervised scene decomposition to perform interaction-rich tasks with a vision-based RL agent. ASIMO decomposes agent-object interaction videos of episodic-length into the agent, objects, and background, predicting their long-term interactions. Further, we explicitly model possible occlusion in the image observations and stably track individual objects. Then, we can correctly deduce the updated positions of individual entities in response to the agent action, only from partial visual observation. Based on the stable entity-wise decomposition and temporal prediction, we formulate a hierarchical framework to train the RL agent that focuses on the context around the object of interest. We demonstrate that our formulation for scene representation can be universally deployed to train different configurations of agents and accomplish several tasks that involve pushing, arranging, and placing multiple rigid objects.