A novel framework for diverse video generation from a single video using frame-conditioned denoising diffusion probabilistic model and ConvNeXt-V2

IF 4.2 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Image and Vision Computing Pub Date : 2025-02-01 DOI:10.1016/j.imavis.2025.105422

Ayushi Verma, Tapas Badal, Abhay Bansal

{"title":"A novel framework for diverse video generation from a single video using frame-conditioned denoising diffusion probabilistic model and ConvNeXt-V2","authors":"Ayushi Verma, Tapas Badal, Abhay Bansal","doi":"10.1016/j.imavis.2025.105422","DOIUrl":null,"url":null,"abstract":"<div><div>The Denoising Diffusion Probabilistic Model (DDPM) has significantly advanced video generation and synthesis. DDPM relies on extensive datasets for its training process. The study presents a novel method for generating videos from a single video through a frame-conditioned Denoising Diffusion Probabilistic Model (DDPM). Additionally, incorporating the ConvNeXt-V2 model significantly boosts the framework’s feature extraction, improving video generation performance. Addressing the data scarcity challenge in video generation, the proposed model framework exploits a single video’s intrinsic complexities and temporal dynamics to generate diverse and realistic sequences. The model’s ability to generalize motion is demonstrated through thorough quantitative assessments, wherein it is trained on segments of the original video and evaluated on previously unseen frames. Integrating Global Response Normalization and Sigmoid-Weighted Linear Unit (SiLU) activation functions within the DDPM framework has enhanced generated video quality. Comparatively, the proposed model markedly outperforms the Sinfusion model across crucial image quality metrics, achieving a lower Freschet Video Distance (FVD) score of 106.52, lower Learned Perceptual Image Patch Similarity (LPIPS) score of 0.085, higher Structural Similarity Index Measure (SSIM) score of 0.089, higher Nearest-Neighbor-Field (NNF) based diversity (NNFDIV) score of 0.44. Furthermore, the model achieves a Peak Signal to Noise Ratio score of 23.95, demonstrating its strength in preserving image integrity despite noise. The integration of Global Response Normalization and SiLU significantly enhances content synthesis, while ConvNeXt-V2 boosts feature extraction, amplifying the model’s efficacy.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"154 ","pages":"Article 105422"},"PeriodicalIF":4.2000,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Image and Vision Computing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0262885625000101","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

The Denoising Diffusion Probabilistic Model (DDPM) has significantly advanced video generation and synthesis. DDPM relies on extensive datasets for its training process. The study presents a novel method for generating videos from a single video through a frame-conditioned Denoising Diffusion Probabilistic Model (DDPM). Additionally, incorporating the ConvNeXt-V2 model significantly boosts the framework’s feature extraction, improving video generation performance. Addressing the data scarcity challenge in video generation, the proposed model framework exploits a single video’s intrinsic complexities and temporal dynamics to generate diverse and realistic sequences. The model’s ability to generalize motion is demonstrated through thorough quantitative assessments, wherein it is trained on segments of the original video and evaluated on previously unseen frames. Integrating Global Response Normalization and Sigmoid-Weighted Linear Unit (SiLU) activation functions within the DDPM framework has enhanced generated video quality. Comparatively, the proposed model markedly outperforms the Sinfusion model across crucial image quality metrics, achieving a lower Freschet Video Distance (FVD) score of 106.52, lower Learned Perceptual Image Patch Similarity (LPIPS) score of 0.085, higher Structural Similarity Index Measure (SSIM) score of 0.089, higher Nearest-Neighbor-Field (NNF) based diversity (NNFDIV) score of 0.44. Furthermore, the model achieves a Peak Signal to Noise Ratio score of 23.95, demonstrating its strength in preserving image integrity despite noise. The integration of Global Response Normalization and SiLU significantly enhances content synthesis, while ConvNeXt-V2 boosts feature extraction, amplifying the model’s efficacy.

查看原文本刊更多论文

利用帧条件去噪扩散概率模型和ConvNeXt-V2，提出了一种从单个视频生成多种视频的新框架

消噪扩散概率模型（DDPM）对视频的生成和合成具有重要的推动作用。DDPM的训练过程依赖于广泛的数据集。提出了一种基于帧条件去噪扩散概率模型（DDPM）的视频生成新方法。此外，结合ConvNeXt-V2模型显著提高了框架的特征提取，提高了视频生成性能。针对视频生成中的数据稀缺性挑战，该模型框架利用单个视频的内在复杂性和时间动态来生成多样化和逼真的序列。该模型的泛化运动的能力是通过彻底的定量评估来证明的，其中它是在原始视频片段上进行训练的，并在以前看不见的帧上进行评估。在DDPM框架内集成全局响应归一化和sigmoid加权线性单元（SiLU）激活函数提高了生成的视频质量。相比而言，该模型在关键图像质量指标上明显优于Sinfusion模型，其fresh Video Distance （FVD）得分为106.52,Learned Perceptual image Patch Similarity （LPIPS）得分为0.085,Structural Similarity Index Measure （SSIM）得分为0.089,Nearest-Neighbor-Field (NNF) based diversity （NNFDIV）得分为0.44。此外，该模型的峰值信噪比得分为23.95，证明了该模型在不受噪声影响的情况下保持图像完整性的能力。全局响应归一化和SiLU的集成显著增强了内容合成，而ConvNeXt-V2增强了特征提取，放大了模型的有效性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Image and Vision Computing 工程技术-工程：电子与电气

CiteScore

8.50

自引率

8.50%

发文量

143

审稿时长

7.8 months

期刊介绍： Image and Vision Computing has as a primary aim the provision of an effective medium of interchange for the results of high quality theoretical and applied research fundamental to all aspects of image interpretation and computer vision. The journal publishes work that proposes new image interpretation and computer vision methodology or addresses the application of such methods to real world scenes. It seeks to strengthen a deeper understanding in the discipline by encouraging the quantitative comparison and performance evaluation of the proposed methodology. The coverage includes: image interpretation, scene modelling, object recognition and tracking, shape analysis, monitoring and surveillance, active vision and robotic systems, SLAM, biologically-inspired computer vision, motion analysis, stereo vision, document image understanding, character and handwritten text recognition, face and gesture recognition, biometrics, vision-based human-computer interaction, human activity and behavior understanding, data fusion from multiple sensor inputs, image databases.