T3VIP:基于变换的$3\ mathm {D}$视频预测

2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) Pub Date : 2022-10-23 DOI:10.1109/IROS47612.2022.9981187

Iman Nematollahi, Erick Rosete-Beas, Seyed Mahdi B. Azad, Raghunandan Rajan, F. Hutter, Wolfram Burgard

{"title":"T3VIP:基于变换的$3\\ mathm {D}$视频预测","authors":"Iman Nematollahi, Erick Rosete-Beas, Seyed Mahdi B. Azad, Raghunandan Rajan, F. Hutter, Wolfram Burgard","doi":"10.1109/IROS47612.2022.9981187","DOIUrl":null,"url":null,"abstract":"For autonomous skill acquisition, robots have to learn about the physical rules governing the 3D world dynamics from their own past experience to predict and reason about plausible future outcomes. To this end, we propose a transformation-based 3D video prediction (T3VIP) approach that explicitly models the 3D motion by decomposing a scene into its object parts and predicting their corresponding rigid transformations. Our model is fully unsupervised, captures the stochastic nature of the real world, and the observational cues in image and point cloud domains constitute its learning signals. To fully leverage all the 2D and 3D observational signals, we equip our model with automatic hyperparameter optimization (HPO) to interpret the best way of learning from them. To the best of our knowledge, our model is the first generative model that provides an RGB-D video prediction of the future for a static camera. Our extensive evaluation with simulated and real-world datasets demonstrates that our formulation leads to interpretable 3D models that predict future depth videos while achieving on-par performance with 2D models on RGB video prediction. Moreover, we demonstrate that our model outperforms 2D baselines on visuomotor control. Videos, code, dataset, and pre-trained models are available at http://t3vip.cs.uni-freiburg.de.","PeriodicalId":431373,"journal":{"name":"2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)","volume":"100 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-10-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"T3VIP: Transformation-based $3\\\\mathrm{D}$ Video Prediction\",\"authors\":\"Iman Nematollahi, Erick Rosete-Beas, Seyed Mahdi B. Azad, Raghunandan Rajan, F. Hutter, Wolfram Burgard\",\"doi\":\"10.1109/IROS47612.2022.9981187\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"For autonomous skill acquisition, robots have to learn about the physical rules governing the 3D world dynamics from their own past experience to predict and reason about plausible future outcomes. To this end, we propose a transformation-based 3D video prediction (T3VIP) approach that explicitly models the 3D motion by decomposing a scene into its object parts and predicting their corresponding rigid transformations. Our model is fully unsupervised, captures the stochastic nature of the real world, and the observational cues in image and point cloud domains constitute its learning signals. To fully leverage all the 2D and 3D observational signals, we equip our model with automatic hyperparameter optimization (HPO) to interpret the best way of learning from them. To the best of our knowledge, our model is the first generative model that provides an RGB-D video prediction of the future for a static camera. Our extensive evaluation with simulated and real-world datasets demonstrates that our formulation leads to interpretable 3D models that predict future depth videos while achieving on-par performance with 2D models on RGB video prediction. Moreover, we demonstrate that our model outperforms 2D baselines on visuomotor control. Videos, code, dataset, and pre-trained models are available at http://t3vip.cs.uni-freiburg.de.\",\"PeriodicalId\":431373,\"journal\":{\"name\":\"2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)\",\"volume\":\"100 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-10-23\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/IROS47612.2022.9981187\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IROS47612.2022.9981187","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

为了获得自主技能，机器人必须从自己过去的经验中学习控制3D世界动态的物理规则，以预测和推理可能的未来结果。为此，我们提出了一种基于变换的3D视频预测(T3VIP)方法，该方法通过将场景分解为其对象部分并预测其相应的刚性变换来明确地建模3D运动。我们的模型是完全无监督的，捕捉了现实世界的随机性，图像和点云域的观察线索构成了它的学习信号。为了充分利用所有2D和3D观测信号，我们为模型配备了自动超参数优化(HPO)，以解释从中学习的最佳方式。据我们所知，我们的模型是第一个为静态摄像机提供未来RGB-D视频预测的生成模型。我们对模拟和现实世界数据集的广泛评估表明，我们的配方可以产生可解释的3D模型，预测未来的深度视频，同时在RGB视频预测上实现与2D模型同等的性能。此外，我们证明了我们的模型在视觉运动控制方面优于2D基线。视频、代码、数据集和预训练模型可在http://t3vip.cs.uni-freiburg.de上获得。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

T3VIP: Transformation-based $3\mathrm{D}$ Video Prediction

For autonomous skill acquisition, robots have to learn about the physical rules governing the 3D world dynamics from their own past experience to predict and reason about plausible future outcomes. To this end, we propose a transformation-based 3D video prediction (T3VIP) approach that explicitly models the 3D motion by decomposing a scene into its object parts and predicting their corresponding rigid transformations. Our model is fully unsupervised, captures the stochastic nature of the real world, and the observational cues in image and point cloud domains constitute its learning signals. To fully leverage all the 2D and 3D observational signals, we equip our model with automatic hyperparameter optimization (HPO) to interpret the best way of learning from them. To the best of our knowledge, our model is the first generative model that provides an RGB-D video prediction of the future for a static camera. Our extensive evaluation with simulated and real-world datasets demonstrates that our formulation leads to interpretable 3D models that predict future depth videos while achieving on-par performance with 2D models on RGB video prediction. Moreover, we demonstrate that our model outperforms 2D baselines on visuomotor control. Videos, code, dataset, and pre-trained models are available at http://t3vip.cs.uni-freiburg.de.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)

自引率

0.00%

发文量