Learning Object-Centric Dynamic Modes from Video and Emerging Properties.

Armand Comas Massague, Christian Fernandez-Lopez, Sandesh Ghimire, Haolin Li, Mario Sznaier, Octavia Camps
{"title":"Learning Object-Centric Dynamic Modes from Video and Emerging Properties.","authors":"Armand Comas Massague, Christian Fernandez-Lopez, Sandesh Ghimire, Haolin Li, Mario Sznaier, Octavia Camps","doi":"","DOIUrl":null,"url":null,"abstract":"<p><p>One of the long-term objectives of Machine Learning is to endow machines with the capacity of structuring and interpreting the world as we do. This is particularly challenging in scenes involving time series, such as video sequences, since seemingly different data can correspond to the same underlying dynamics. Recent approaches seek to decompose video sequences into their composing objects, attributes and dynamics in a self-supervised fashion, thus simplifying the task of learning suitable features that can be used to analyze each component. While existing methods can successfully disentangle dynamics from other components, there have been relatively few efforts in learning parsimonious representations of these underlying dynamics. In this paper, motivated by recent advances in non-linear identification, we propose a method to decompose a video into moving objects, their attributes and the dynamic modes of their trajectories. We model video dynamics as the output of a Koopman operator to be learned from the available data. In this context, the dynamic information contained in the scene is encapsulated in the eigenvalues and eigenvectors of the Koopman operator, providing an interpretable and parsimonious representation. We show that such decomposition can be used for instance to perform video analytics, predict future frames or generate synthetic video. We test our framework in a variety of datasets that encompass different dynamic scenarios, while illustrating the novel features that emerge from our dynamic modes decomposition: Video dynamics interpretation and user manipulation at test-time. We successfully forecast challenging object trajectories from pixels, achieving competitive performance while drawing useful insights.</p>","PeriodicalId":74504,"journal":{"name":"Proceedings of machine learning research","volume":"5 ","pages":"745-769"},"PeriodicalIF":0.0000,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12395393/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of machine learning research","FirstCategoryId":"1085","ListUrlMain":"","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

One of the long-term objectives of Machine Learning is to endow machines with the capacity of structuring and interpreting the world as we do. This is particularly challenging in scenes involving time series, such as video sequences, since seemingly different data can correspond to the same underlying dynamics. Recent approaches seek to decompose video sequences into their composing objects, attributes and dynamics in a self-supervised fashion, thus simplifying the task of learning suitable features that can be used to analyze each component. While existing methods can successfully disentangle dynamics from other components, there have been relatively few efforts in learning parsimonious representations of these underlying dynamics. In this paper, motivated by recent advances in non-linear identification, we propose a method to decompose a video into moving objects, their attributes and the dynamic modes of their trajectories. We model video dynamics as the output of a Koopman operator to be learned from the available data. In this context, the dynamic information contained in the scene is encapsulated in the eigenvalues and eigenvectors of the Koopman operator, providing an interpretable and parsimonious representation. We show that such decomposition can be used for instance to perform video analytics, predict future frames or generate synthetic video. We test our framework in a variety of datasets that encompass different dynamic scenarios, while illustrating the novel features that emerge from our dynamic modes decomposition: Video dynamics interpretation and user manipulation at test-time. We successfully forecast challenging object trajectories from pixels, achieving competitive performance while drawing useful insights.

从视频和新兴属性中学习以对象为中心的动态模式。
机器学习的长期目标之一是赋予机器像我们一样构建和解释世界的能力。这在涉及时间序列(如视频序列)的场景中尤其具有挑战性,因为看似不同的数据可能对应相同的潜在动态。最近的方法试图以一种自监督的方式将视频序列分解为它们的组合对象、属性和动态,从而简化了学习可用于分析每个组件的合适特征的任务。虽然现有的方法可以成功地将动态从其他组件中分离出来,但在学习这些潜在动态的简约表示方面所做的努力相对较少。在本文中,受非线性识别的最新进展的启发,我们提出了一种将视频分解为运动物体、它们的属性和它们的轨迹的动态模式的方法。我们将视频动态建模为从可用数据中学习的库普曼算子的输出。在这种情况下,场景中包含的动态信息被封装在库普曼算子的特征值和特征向量中,提供了一种可解释和简洁的表示。我们表明,这种分解可以用于例如执行视频分析,预测未来帧或生成合成视频。我们在包含不同动态场景的各种数据集中测试了我们的框架,同时说明了从我们的动态模式分解中出现的新特征:视频动态解释和测试时的用户操作。我们成功地从像素预测了具有挑战性的物体轨迹,在获得有用见解的同时获得了具有竞争力的性能。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信