Scalable video transformer for full-frame video prediction

IF 4.3 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Vision and Image Understanding Pub Date : 2024-09-13 DOI:10.1016/j.cviu.2024.104166

Zhan Li, Feng Liu

{"title":"Scalable video transformer for full-frame video prediction","authors":"Zhan Li, Feng Liu","doi":"10.1016/j.cviu.2024.104166","DOIUrl":null,"url":null,"abstract":"<div><div>Vision Transformers (ViTs) have shown success in many low-level computer vision tasks. However, existing ViT models are limited by their high computation and memory cost when generating high-resolution videos for tasks like video prediction. This paper presents a scalable video transformer for full-frame video predication. Specifically, we design a backbone transformer block for our video transformer. This transformer block decouples the temporal and channel features to reduce the computation cost when processing large-scale spatial–temporal video features. We use transposed attention to focus on the channel dimension instead of the spatial window to further reduce the computation cost. We also design a Global Shifted Multi-Dconv Head Transposed Attention module (GSMDTA) for our transformer block. This module is built upon two key ideas. First, we design a depth shift module to better incorporate the cross-channel or temporal information from video features. Second, we introduce a global query mechanism to capture global information to handle large motion for video prediction. This new transformer block enables our video transformer to predict a full frame from multiple past frames at the resolution of 1024 × 512 with 12 GB VRAM. Experiments on various video prediction benchmarks demonstrate that our method with only RGB input outperforms state-of-the-art methods that require additional data, like segmentation maps and optical flows. Our method exceeds the state-of-the-art RGB-only methods by a large margin (1.2 dB) in PSNR. Our method is also faster than state-of-the-art video prediction transformers.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"249 ","pages":"Article 104166"},"PeriodicalIF":4.3000,"publicationDate":"2024-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Vision and Image Understanding","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1077314224002479","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Vision Transformers (ViTs) have shown success in many low-level computer vision tasks. However, existing ViT models are limited by their high computation and memory cost when generating high-resolution videos for tasks like video prediction. This paper presents a scalable video transformer for full-frame video predication. Specifically, we design a backbone transformer block for our video transformer. This transformer block decouples the temporal and channel features to reduce the computation cost when processing large-scale spatial–temporal video features. We use transposed attention to focus on the channel dimension instead of the spatial window to further reduce the computation cost. We also design a Global Shifted Multi-Dconv Head Transposed Attention module (GSMDTA) for our transformer block. This module is built upon two key ideas. First, we design a depth shift module to better incorporate the cross-channel or temporal information from video features. Second, we introduce a global query mechanism to capture global information to handle large motion for video prediction. This new transformer block enables our video transformer to predict a full frame from multiple past frames at the resolution of 1024 × 512 with 12 GB VRAM. Experiments on various video prediction benchmarks demonstrate that our method with only RGB input outperforms state-of-the-art methods that require additional data, like segmentation maps and optical flows. Our method exceeds the state-of-the-art RGB-only methods by a large margin (1.2 dB) in PSNR. Our method is also faster than state-of-the-art video prediction transformers.

查看原文本刊更多论文

用于全帧视频预测的可扩展视频变换器

视觉变换器（ViT）在许多低级计算机视觉任务中都取得了成功。然而，在为视频预测等任务生成高分辨率视频时，现有的 ViT 模型受限于其较高的计算和内存成本。本文提出了一种用于全帧视频预测的可扩展视频变换器。具体来说，我们为视频转换器设计了一个骨干转换器块。在处理大规模时空视频特征时，该转换器块将时间特征和通道特征分离开来，从而降低了计算成本。我们使用转置注意力来关注信道维度，而不是空间窗口，从而进一步降低计算成本。我们还为转换器模块设计了全局偏移多视角转置注意力模块（GSMDTA）。该模块基于两个关键理念。首先，我们设计了一个深度偏移模块，以更好地整合视频特征中的跨信道或时间信息。其次，我们引入了全局查询机制，以捕捉全局信息来处理视频预测中的大运动。这个新的转换器模块使我们的视频转换器能够在 1024 × 512 分辨率和 12 GB VRAM 的条件下，从多个过去的帧中预测一个完整的帧。在各种视频预测基准上进行的实验表明，我们的方法在仅使用 RGB 输入的情况下，优于需要额外数据（如分割图和光流）的先进方法。我们的方法在 PSNR 方面大大超过了最先进的纯 RGB 方法（1.2 dB）。我们的方法也比最先进的视频预测变换器更快。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Computer Vision and Image Understanding 工程技术-工程：电子与电气

CiteScore

7.80

自引率

4.40%

发文量

112

审稿时长

79 days

期刊介绍： The central focus of this journal is the computer analysis of pictorial information. Computer Vision and Image Understanding publishes papers covering all aspects of image analysis from the low-level, iconic processes of early vision to the high-level, symbolic processes of recognition and interpretation. A wide range of topics in the image understanding area is covered, including papers offering insights that differ from predominant views. Research Areas Include: • Theory • Early vision • Data structures and representations • Shape • Range • Motion • Matching and recognition • Architecture and languages • Vision systems