VMG: Rethinking U-Net Architecture for Video Super-Resolution

IF 4.8 1区计算机科学 Q2 ENGINEERING, ELECTRICAL & ELECTRONIC

IEEE Transactions on Broadcasting Pub Date : 2024-11-21 DOI:10.1109/TBC.2024.3486967

Jun Tang;Lele Niu;Linlin Liu;Hang Dai;Yong Ding

{"title":"VMG: Rethinking U-Net Architecture for Video Super-Resolution","authors":"Jun Tang;Lele Niu;Linlin Liu;Hang Dai;Yong Ding","doi":"10.1109/TBC.2024.3486967","DOIUrl":null,"url":null,"abstract":"The U-Net architecture has exhibited significant efficacy across various vision tasks, yet its adaptation for Video Super-Resolution (VSR) remains underexplored. While the Video Restoration Transformer (VRT) introduced U-Net into the VSR domain, it poses challenges due to intricate design and substantial computational overhead. In this paper, we present VMG, a streamlined framework tailored for VSR. Through empirical analysis, we identify the crucial stages of the U-Net architecture contributing to performance enhancement in VSR tasks. Our optimized architecture substantially reduces model parameters and complexity while improving performance. Additionally, we introduce two key modules, namely the Gated MLP-like Mixer (GMM) and the Flow-Guided cross-attention Mixer (FGM), designed to enhance spatial and temporal feature aggregation. GMM dynamically encodes spatial correlations with linear complexity in space and time, and FGM leverages optical flow to capture motion variation and implement sparse attention to efficiently aggregate temporally related information. Extensive experiments demonstrate that VMG achieves nearly 70% reduction in GPU memory usage, 30% fewer parameters, and 10% lower computational complexity (FLOPs) compared to VRT, while yielding highly competitive or superior results across four benchmark datasets. Qualitative assessments reveal VMG’s ability to preserve remarkable details and sharp structures in the reconstructed videos. The code and pre-trained models are available at <uri>https://github.com/EasyVision-Ton/VMG</uri>.","PeriodicalId":13159,"journal":{"name":"IEEE Transactions on Broadcasting","volume":"71 1","pages":"334-349"},"PeriodicalIF":4.8000,"publicationDate":"2024-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Broadcasting","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10762902/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}

引用次数: 0

Abstract

The U-Net architecture has exhibited significant efficacy across various vision tasks, yet its adaptation for Video Super-Resolution (VSR) remains underexplored. While the Video Restoration Transformer (VRT) introduced U-Net into the VSR domain, it poses challenges due to intricate design and substantial computational overhead. In this paper, we present VMG, a streamlined framework tailored for VSR. Through empirical analysis, we identify the crucial stages of the U-Net architecture contributing to performance enhancement in VSR tasks. Our optimized architecture substantially reduces model parameters and complexity while improving performance. Additionally, we introduce two key modules, namely the Gated MLP-like Mixer (GMM) and the Flow-Guided cross-attention Mixer (FGM), designed to enhance spatial and temporal feature aggregation. GMM dynamically encodes spatial correlations with linear complexity in space and time, and FGM leverages optical flow to capture motion variation and implement sparse attention to efficiently aggregate temporally related information. Extensive experiments demonstrate that VMG achieves nearly 70% reduction in GPU memory usage, 30% fewer parameters, and 10% lower computational complexity (FLOPs) compared to VRT, while yielding highly competitive or superior results across four benchmark datasets. Qualitative assessments reveal VMG’s ability to preserve remarkable details and sharp structures in the reconstructed videos. The code and pre-trained models are available at https://github.com/EasyVision-Ton/VMG.

查看原文本刊更多论文

VMG：重新思考视频超分辨率的U-Net架构

U-Net架构在各种视觉任务中表现出显著的有效性，但其对视频超分辨率（VSR）的适应性仍有待探索。虽然视频恢复变压器（VRT）将U-Net引入了VSR领域，但由于复杂的设计和大量的计算开销，它带来了挑战。在本文中，我们提出了VMG，一个为VSR量身定制的流线型框架。通过实证分析，我们确定了有助于提高VSR任务性能的U-Net架构的关键阶段。我们优化的架构大大降低了模型参数和复杂性，同时提高了性能。此外，我们还介绍了两个关键模块，即门控MLP-like Mixer （GMM）和Flow-Guided cross-attention Mixer (FGM)，旨在增强时空特征聚合。GMM动态编码具有空间和时间线性复杂性的空间相关性，FGM利用光流捕获运动变化并实现稀疏关注以有效聚合时间相关信息。大量实验表明，与VRT相比，VMG在GPU内存使用方面减少了近70%，参数减少了30%，计算复杂度（FLOPs）降低了10%，同时在四个基准数据集上产生了极具竞争力或更优的结果。定性评估显示VMG能够在重建的视频中保留显著的细节和清晰的结构。代码和预训练模型可在https://github.com/EasyVision-Ton/VMG上获得。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Transactions on Broadcasting 工程技术-电信学

CiteScore

9.40

自引率

31.10%

发文量

审稿时长

6-12 weeks

期刊介绍： The Society’s Field of Interest is “Devices, equipment, techniques and systems related to broadcast technology, including the production, distribution, transmission, and propagation aspects.” In addition to this formal FOI statement, which is used to provide guidance to the Publications Committee in the selection of content, the AdCom has further resolved that “broadcast systems includes all aspects of transmission, propagation, and reception.”