Learning Object-Centric Dynamic Modes from Video and Emerging Properties.

Proceedings of machine learning research Pub Date : 2023-01-01

Armand Comas Massague, Christian Fernandez-Lopez, Sandesh Ghimire, Haolin Li, Mario Sznaier, Octavia Camps

{"title":"Learning Object-Centric Dynamic Modes from Video and Emerging Properties.","authors":"Armand Comas Massague, Christian Fernandez-Lopez, Sandesh Ghimire, Haolin Li, Mario Sznaier, Octavia Camps","doi":"","DOIUrl":null,"url":null,"abstract":"<p><p>One of the long-term objectives of Machine Learning is to endow machines with the capacity of structuring and interpreting the world as we do. This is particularly challenging in scenes involving time series, such as video sequences, since seemingly different data can correspond to the same underlying dynamics. Recent approaches seek to decompose video sequences into their composing objects, attributes and dynamics in a self-supervised fashion, thus simplifying the task of learning suitable features that can be used to analyze each component. While existing methods can successfully disentangle dynamics from other components, there have been relatively few efforts in learning parsimonious representations of these underlying dynamics. In this paper, motivated by recent advances in non-linear identification, we propose a method to decompose a video into moving objects, their attributes and the dynamic modes of their trajectories. We model video dynamics as the output of a Koopman operator to be learned from the available data. In this context, the dynamic information contained in the scene is encapsulated in the eigenvalues and eigenvectors of the Koopman operator, providing an interpretable and parsimonious representation. We show that such decomposition can be used for instance to perform video analytics, predict future frames or generate synthetic video. We test our framework in a variety of datasets that encompass different dynamic scenarios, while illustrating the novel features that emerge from our dynamic modes decomposition: Video dynamics interpretation and user manipulation at test-time. We successfully forecast challenging object trajectories from pixels, achieving competitive performance while drawing useful insights.</p>","PeriodicalId":74504,"journal":{"name":"Proceedings of machine learning research","volume":"5 ","pages":"745-769"},"PeriodicalIF":0.0000,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12395393/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of machine learning research","FirstCategoryId":"1085","ListUrlMain":"","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

One of the long-term objectives of Machine Learning is to endow machines with the capacity of structuring and interpreting the world as we do. This is particularly challenging in scenes involving time series, such as video sequences, since seemingly different data can correspond to the same underlying dynamics. Recent approaches seek to decompose video sequences into their composing objects, attributes and dynamics in a self-supervised fashion, thus simplifying the task of learning suitable features that can be used to analyze each component. While existing methods can successfully disentangle dynamics from other components, there have been relatively few efforts in learning parsimonious representations of these underlying dynamics. In this paper, motivated by recent advances in non-linear identification, we propose a method to decompose a video into moving objects, their attributes and the dynamic modes of their trajectories. We model video dynamics as the output of a Koopman operator to be learned from the available data. In this context, the dynamic information contained in the scene is encapsulated in the eigenvalues and eigenvectors of the Koopman operator, providing an interpretable and parsimonious representation. We show that such decomposition can be used for instance to perform video analytics, predict future frames or generate synthetic video. We test our framework in a variety of datasets that encompass different dynamic scenarios, while illustrating the novel features that emerge from our dynamic modes decomposition: Video dynamics interpretation and user manipulation at test-time. We successfully forecast challenging object trajectories from pixels, achieving competitive performance while drawing useful insights.

本刊更多论文

从视频和新兴属性中学习以对象为中心的动态模式。

机器学习的长期目标之一是赋予机器像我们一样构建和解释世界的能力。这在涉及时间序列（如视频序列）的场景中尤其具有挑战性，因为看似不同的数据可能对应相同的潜在动态。最近的方法试图以一种自监督的方式将视频序列分解为它们的组合对象、属性和动态，从而简化了学习可用于分析每个组件的合适特征的任务。虽然现有的方法可以成功地将动态从其他组件中分离出来，但在学习这些潜在动态的简约表示方面所做的努力相对较少。在本文中，受非线性识别的最新进展的启发，我们提出了一种将视频分解为运动物体、它们的属性和它们的轨迹的动态模式的方法。我们将视频动态建模为从可用数据中学习的库普曼算子的输出。在这种情况下，场景中包含的动态信息被封装在库普曼算子的特征值和特征向量中，提供了一种可解释和简洁的表示。我们表明，这种分解可以用于例如执行视频分析，预测未来帧或生成合成视频。我们在包含不同动态场景的各种数据集中测试了我们的框架，同时说明了从我们的动态模式分解中出现的新特征：视频动态解释和测试时的用户操作。我们成功地从像素预测了具有挑战性的物体轨迹，在获得有用见解的同时获得了具有竞争力的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of machine learning research

自引率

0.00%

发文量