Video Prediction of Dynamic Physical Simulations With Pixel-Space Spatiotemporal Transformers.

IF 10.2 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

IEEE transactions on neural networks and learning systems Pub Date : 2025-07-22 DOI:10.1109/tnnls.2025.3585949

Dean L Slack,G Thomas Hudson,Thomas Winterbottom,Noura Al Moubayed

{"title":"Video Prediction of Dynamic Physical Simulations With Pixel-Space Spatiotemporal Transformers.","authors":"Dean L Slack,G Thomas Hudson,Thomas Winterbottom,Noura Al Moubayed","doi":"10.1109/tnnls.2025.3585949","DOIUrl":null,"url":null,"abstract":"Inspired by the performance and scalability of autoregressive large language models (LLMs), transformer-based models have seen recent success in the visual domain. This study investigates a transformer adaptation for video prediction with a simple end-to-end approach, comparing various spatiotemporal self-attention layouts. Focusing on causal modeling of physical simulations over time; a common shortcoming of existing video-generative approaches, we attempt to isolate spatiotemporal reasoning via physical object tracking metrics and unsupervised training on physical simulation datasets. We introduce a simple yet effective pure transformer model for autoregressive video prediction, utilizing continuous pixel-space representations for video prediction. Without the need for complex training strategies or latent feature-learning components, our approach significantly extends the time horizon for physically accurate predictions by up to 50% when compared with existing latent-space approaches, while maintaining comparable performance on common video quality metrics. In addition, we conduct interpretability experiments to identify network regions that encode information useful to perform accurate estimations of PDE simulation parameters via probing models, and find that this generalizes to the estimation of out-of-distribution simulation parameters. This work serves as a platform for further attention-based spatiotemporal modeling of videos via a simple, parameter efficient, and interpretable approach.","PeriodicalId":13303,"journal":{"name":"IEEE transactions on neural networks and learning systems","volume":"18 1","pages":""},"PeriodicalIF":10.2000,"publicationDate":"2025-07-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on neural networks and learning systems","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1109/tnnls.2025.3585949","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Inspired by the performance and scalability of autoregressive large language models (LLMs), transformer-based models have seen recent success in the visual domain. This study investigates a transformer adaptation for video prediction with a simple end-to-end approach, comparing various spatiotemporal self-attention layouts. Focusing on causal modeling of physical simulations over time; a common shortcoming of existing video-generative approaches, we attempt to isolate spatiotemporal reasoning via physical object tracking metrics and unsupervised training on physical simulation datasets. We introduce a simple yet effective pure transformer model for autoregressive video prediction, utilizing continuous pixel-space representations for video prediction. Without the need for complex training strategies or latent feature-learning components, our approach significantly extends the time horizon for physically accurate predictions by up to 50% when compared with existing latent-space approaches, while maintaining comparable performance on common video quality metrics. In addition, we conduct interpretability experiments to identify network regions that encode information useful to perform accurate estimations of PDE simulation parameters via probing models, and find that this generalizes to the estimation of out-of-distribution simulation parameters. This work serves as a platform for further attention-based spatiotemporal modeling of videos via a simple, parameter efficient, and interpretable approach.

查看原文本刊更多论文

基于像素空间时空变换的动态物理模拟视频预测。

受自回归大型语言模型（llm）的性能和可扩展性的启发，基于转换器的模型最近在视觉领域取得了成功。本研究探讨了一种简单的端到端自适应视频预测方法，比较了各种时空自注意布局。侧重于随时间变化的物理模拟的因果建模；现有视频生成方法的一个共同缺点是，我们试图通过物理对象跟踪度量和物理模拟数据集上的无监督训练来隔离时空推理。我们介绍了一个简单而有效的纯变压器模型用于自回归视频预测，利用连续像素空间表示进行视频预测。在不需要复杂的训练策略或潜在特征学习组件的情况下，与现有的潜在空间方法相比，我们的方法显著延长了物理准确预测的时间范围，最多可达50%，同时在常见的视频质量指标上保持相当的性能。此外，我们进行了可解释性实验，以识别编码信息的网络区域，这些信息有助于通过探测模型对PDE仿真参数进行准确估计，并发现这可以推广到对分布外仿真参数的估计。这项工作为通过简单、参数高效和可解释的方法进一步基于注意力的视频时空建模提供了一个平台。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE transactions on neural networks and learning systems COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE-COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

CiteScore

23.80

自引率

9.60%

发文量

2102

审稿时长

3-8 weeks

期刊介绍： The focus of IEEE Transactions on Neural Networks and Learning Systems is to present scholarly articles discussing the theory, design, and applications of neural networks as well as other learning systems. The journal primarily highlights technical and scientific research in this domain.