Fast-Vid2Vid++：用于实时视频到视频合成的时空蒸馏。

IEEE transactions on pattern analysis and machine intelligence Pub Date : 2024-08-27 DOI:10.1109/TPAMI.2024.3450630

Long Zhuo;Guangcong Wang;Shikai Li;Wayne Wu;Ziwei Liu

{"title":"Fast-Vid2Vid++：用于实时视频到视频合成的时空蒸馏。","authors":"Long Zhuo;Guangcong Wang;Shikai Li;Wayne Wu;Ziwei Liu","doi":"10.1109/TPAMI.2024.3450630","DOIUrl":null,"url":null,"abstract":"Video-to-Video synthesis (Vid2Vid) gains remarkable performance in generating a photo-realistic video from a sequence of semantic maps, such as segmentation, sketch and pose. However, this pipeline is heavily limited to high computational cost and long inference latency, mainly attributed to two essential factors: 1) network architecture parameters, 2) sequential data stream. Recently, the parameters of image-based generative models have been significantly reduced via more efficient network architectures. Existing methods mainly focus on slimming network architectures but ignore the size of the sequential data stream. Moreover, due to the lack of temporal coherence, image-based compression is not sufficient for the compression of the video task. In this paper, we present a spatial-temporal hybrid distillation compression framework, \n<bold>Fast-Vid2Vid++</b>\n, which focuses on knowledge distillation of the teacher network and the data stream of generative models on both space and time. Fast-Vid2Vid++ makes the first attempt at time dimension to transfer hierarchical features and time coherence knowledge to reduce computational resources and accelerate inference. Specifically, we compress the data stream spatially and reduce the temporal redundancy. We distill the knowledge of the hierarchical features and the final response from the teacher network to the student network in high-resolution and full-time domains. We transfer the long-term dependencies of the features and video frames to the student model. After the proposed spatial-temporal hybrid knowledge distillation (Spatial-Temporal-HKD), our model can synthesize high-resolution key-frames using the low-resolution data stream. Finally, Fast-Vid2Vid++ interpolates intermediate frames by motion compensation with slight latency and generates full-length sequences with motion-aware inference (MAI). On standard benchmarks, Fast-Vid2Vid++ achieves a real-time performance of 30–59 FPS and saves 28–35× computational cost on a single V100 GPU.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"46 12","pages":"10732-10747"},"PeriodicalIF":0.0000,"publicationDate":"2024-08-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Fast-Vid2Vid++: Spatial-Temporal Distillation for Real-Time Video-to-Video Synthesis\",\"authors\":\"Long Zhuo;Guangcong Wang;Shikai Li;Wayne Wu;Ziwei Liu\",\"doi\":\"10.1109/TPAMI.2024.3450630\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Video-to-Video synthesis (Vid2Vid) gains remarkable performance in generating a photo-realistic video from a sequence of semantic maps, such as segmentation, sketch and pose. However, this pipeline is heavily limited to high computational cost and long inference latency, mainly attributed to two essential factors: 1) network architecture parameters, 2) sequential data stream. Recently, the parameters of image-based generative models have been significantly reduced via more efficient network architectures. Existing methods mainly focus on slimming network architectures but ignore the size of the sequential data stream. Moreover, due to the lack of temporal coherence, image-based compression is not sufficient for the compression of the video task. In this paper, we present a spatial-temporal hybrid distillation compression framework, \\n<bold>Fast-Vid2Vid++</b>\\n, which focuses on knowledge distillation of the teacher network and the data stream of generative models on both space and time. Fast-Vid2Vid++ makes the first attempt at time dimension to transfer hierarchical features and time coherence knowledge to reduce computational resources and accelerate inference. Specifically, we compress the data stream spatially and reduce the temporal redundancy. We distill the knowledge of the hierarchical features and the final response from the teacher network to the student network in high-resolution and full-time domains. We transfer the long-term dependencies of the features and video frames to the student model. After the proposed spatial-temporal hybrid knowledge distillation (Spatial-Temporal-HKD), our model can synthesize high-resolution key-frames using the low-resolution data stream. Finally, Fast-Vid2Vid++ interpolates intermediate frames by motion compensation with slight latency and generates full-length sequences with motion-aware inference (MAI). On standard benchmarks, Fast-Vid2Vid++ achieves a real-time performance of 30–59 FPS and saves 28–35× computational cost on a single V100 GPU.\",\"PeriodicalId\":94034,\"journal\":{\"name\":\"IEEE transactions on pattern analysis and machine intelligence\",\"volume\":\"46 12\",\"pages\":\"10732-10747\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-08-27\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE transactions on pattern analysis and machine intelligence\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10652893/\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on pattern analysis and machine intelligence","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10652893/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

视频到视频合成（Vid2Vid）在从一系列语义映射（如分割、草图和姿势）生成逼真的照片视频方面表现出色。然而，这一管道受到计算成本高和推理延迟长的严重限制，这主要归因于两个基本因素：1) 网络架构参数，2) 连续数据流。最近，通过更高效的网络架构，基于图像的生成模型的参数被大大降低。现有方法主要关注网络架构的精简，但忽略了序列数据流的大小。此外，由于缺乏时间连贯性，基于图像的压缩不足以完成视频任务的压缩。在本文中，我们提出了一种空间-时间混合蒸馏压缩框架 Fast-Vid2Vid++，它侧重于教师网络和生成模型数据流在空间和时间上的知识蒸馏。Fast-Vid2Vid++ 首次尝试在时间维度上传输分层特征和时间一致性知识，以减少计算资源并加速推理。具体来说，我们从空间上压缩数据流，减少时间冗余。我们在高分辨率和全时域中将层次特征知识和最终响应从教师网络提炼到学生网络。我们将特征和视频帧的长期依赖关系转移到学生模型中。经过提出的时空混合知识提炼（Spatial-Temporal-HKD），我们的模型可以利用低分辨率数据流合成高分辨率关键帧。最后，Fast-Vid2Vid++ 通过运动补偿对中间帧进行插值，延迟较小，并通过运动感知推理（MAI）生成全长序列。在标准基准测试中，Fast-Vid2Vid++ 实现了 30-59 FPS 的实时性能，并在单个 V100 GPU 上节省了 28-35 倍的计算成本。代码和模型可公开获取。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Fast-Vid2Vid++: Spatial-Temporal Distillation for Real-Time Video-to-Video Synthesis

Video-to-Video synthesis (Vid2Vid) gains remarkable performance in generating a photo-realistic video from a sequence of semantic maps, such as segmentation, sketch and pose. However, this pipeline is heavily limited to high computational cost and long inference latency, mainly attributed to two essential factors: 1) network architecture parameters, 2) sequential data stream. Recently, the parameters of image-based generative models have been significantly reduced via more efficient network architectures. Existing methods mainly focus on slimming network architectures but ignore the size of the sequential data stream. Moreover, due to the lack of temporal coherence, image-based compression is not sufficient for the compression of the video task. In this paper, we present a spatial-temporal hybrid distillation compression framework, Fast-Vid2Vid++ , which focuses on knowledge distillation of the teacher network and the data stream of generative models on both space and time. Fast-Vid2Vid++ makes the first attempt at time dimension to transfer hierarchical features and time coherence knowledge to reduce computational resources and accelerate inference. Specifically, we compress the data stream spatially and reduce the temporal redundancy. We distill the knowledge of the hierarchical features and the final response from the teacher network to the student network in high-resolution and full-time domains. We transfer the long-term dependencies of the features and video frames to the student model. After the proposed spatial-temporal hybrid knowledge distillation (Spatial-Temporal-HKD), our model can synthesize high-resolution key-frames using the low-resolution data stream. Finally, Fast-Vid2Vid++ interpolates intermediate frames by motion compensation with slight latency and generates full-length sequences with motion-aware inference (MAI). On standard benchmarks, Fast-Vid2Vid++ achieves a real-time performance of 30–59 FPS and saves 28–35× computational cost on a single V100 GPU.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE transactions on pattern analysis and machine intelligence

自引率

0.00%

发文量