{"title":"一种用于时空预测学习的掩码自编码器网络","authors":"Fengzhen Sun, Weidong Jin","doi":"10.1007/s10489-024-06214-2","DOIUrl":null,"url":null,"abstract":"<div><p>This paper is about predictive learning, which is generating future frames given previous images. Suffering from the vanishing gradient problem, existing methods based on RNN and CNN can’t capture the long-term dependencies effectively. To overcome the above dilemma, we present MastNet a spatiotemporal framework for long-term predictive learning. In this paper, we design a Transformer-based encoder-decoder with hierarchical structure. As for the transformer block, we adopt the spatiotemporal window based self-attention to reduce computational complexity and the spatiotemporal shifted window partitioning approach. More importantly, we build a spatiotemporal autoencoder by the random clip mask strategy, which leads to better feature mining for temporal dependencies and spatial correlations. Furthermore, we insert an auxiliary prediction head, which can help our model generate higher-quality frames. Experimental results show that the proposed MastNet achieves the best results in accuracy and long-term prediction on two spatiotemporal datasets compared with the state-of-the-art models.</p></div>","PeriodicalId":8041,"journal":{"name":"Applied Intelligence","volume":"55 5","pages":""},"PeriodicalIF":3.4000,"publicationDate":"2025-01-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A masked autoencoder network for spatiotemporal predictive learning\",\"authors\":\"Fengzhen Sun, Weidong Jin\",\"doi\":\"10.1007/s10489-024-06214-2\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>This paper is about predictive learning, which is generating future frames given previous images. Suffering from the vanishing gradient problem, existing methods based on RNN and CNN can’t capture the long-term dependencies effectively. To overcome the above dilemma, we present MastNet a spatiotemporal framework for long-term predictive learning. In this paper, we design a Transformer-based encoder-decoder with hierarchical structure. As for the transformer block, we adopt the spatiotemporal window based self-attention to reduce computational complexity and the spatiotemporal shifted window partitioning approach. More importantly, we build a spatiotemporal autoencoder by the random clip mask strategy, which leads to better feature mining for temporal dependencies and spatial correlations. Furthermore, we insert an auxiliary prediction head, which can help our model generate higher-quality frames. Experimental results show that the proposed MastNet achieves the best results in accuracy and long-term prediction on two spatiotemporal datasets compared with the state-of-the-art models.</p></div>\",\"PeriodicalId\":8041,\"journal\":{\"name\":\"Applied Intelligence\",\"volume\":\"55 5\",\"pages\":\"\"},\"PeriodicalIF\":3.4000,\"publicationDate\":\"2025-01-20\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Applied Intelligence\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://link.springer.com/article/10.1007/s10489-024-06214-2\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Applied Intelligence","FirstCategoryId":"94","ListUrlMain":"https://link.springer.com/article/10.1007/s10489-024-06214-2","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
A masked autoencoder network for spatiotemporal predictive learning
This paper is about predictive learning, which is generating future frames given previous images. Suffering from the vanishing gradient problem, existing methods based on RNN and CNN can’t capture the long-term dependencies effectively. To overcome the above dilemma, we present MastNet a spatiotemporal framework for long-term predictive learning. In this paper, we design a Transformer-based encoder-decoder with hierarchical structure. As for the transformer block, we adopt the spatiotemporal window based self-attention to reduce computational complexity and the spatiotemporal shifted window partitioning approach. More importantly, we build a spatiotemporal autoencoder by the random clip mask strategy, which leads to better feature mining for temporal dependencies and spatial correlations. Furthermore, we insert an auxiliary prediction head, which can help our model generate higher-quality frames. Experimental results show that the proposed MastNet achieves the best results in accuracy and long-term prediction on two spatiotemporal datasets compared with the state-of-the-art models.
期刊介绍:
With a focus on research in artificial intelligence and neural networks, this journal addresses issues involving solutions of real-life manufacturing, defense, management, government and industrial problems which are too complex to be solved through conventional approaches and require the simulation of intelligent thought processes, heuristics, applications of knowledge, and distributed and parallel processing. The integration of these multiple approaches in solving complex problems is of particular importance.
The journal presents new and original research and technological developments, addressing real and complex issues applicable to difficult problems. It provides a medium for exchanging scientific research and technological achievements accomplished by the international community.