{"title":"OSV: One Step is Enough for High-Quality Image to Video Generation","authors":"Xiaofeng Mao, Zhengkai Jiang, Fu-Yun Wang, Wenbing Zhu, Jiangning Zhang, Hao Chen, Mingmin Chi, Yabiao Wang","doi":"arxiv-2409.11367","DOIUrl":null,"url":null,"abstract":"Video diffusion models have shown great potential in generating high-quality\nvideos, making them an increasingly popular focus. However, their inherent\niterative nature leads to substantial computational and time costs. While\nefforts have been made to accelerate video diffusion by reducing inference\nsteps (through techniques like consistency distillation) and GAN training\n(these approaches often fall short in either performance or training\nstability). In this work, we introduce a two-stage training framework that\neffectively combines consistency distillation with GAN training to address\nthese challenges. Additionally, we propose a novel video discriminator design,\nwhich eliminates the need for decoding the video latents and improves the final\nperformance. Our model is capable of producing high-quality videos in merely\none-step, with the flexibility to perform multi-step refinement for further\nperformance enhancement. Our quantitative evaluation on the OpenWebVid-1M\nbenchmark shows that our model significantly outperforms existing methods.\nNotably, our 1-step performance(FVD 171.15) exceeds the 8-step performance of\nthe consistency distillation based method, AnimateLCM (FVD 184.79), and\napproaches the 25-step performance of advanced Stable Video Diffusion (FVD\n156.94).","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Computer Vision and Pattern Recognition","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.11367","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Video diffusion models have shown great potential in generating high-quality
videos, making them an increasingly popular focus. However, their inherent
iterative nature leads to substantial computational and time costs. While
efforts have been made to accelerate video diffusion by reducing inference
steps (through techniques like consistency distillation) and GAN training
(these approaches often fall short in either performance or training
stability). In this work, we introduce a two-stage training framework that
effectively combines consistency distillation with GAN training to address
these challenges. Additionally, we propose a novel video discriminator design,
which eliminates the need for decoding the video latents and improves the final
performance. Our model is capable of producing high-quality videos in merely
one-step, with the flexibility to perform multi-step refinement for further
performance enhancement. Our quantitative evaluation on the OpenWebVid-1M
benchmark shows that our model significantly outperforms existing methods.
Notably, our 1-step performance(FVD 171.15) exceeds the 8-step performance of
the consistency distillation based method, AnimateLCM (FVD 184.79), and
approaches the 25-step performance of advanced Stable Video Diffusion (FVD
156.94).