{"title":"Single image and video generation using a receptive diffusion model with convolutional spatiotemporal blocks","authors":"Yingli Hou , Wei Zhang , Zhiliang Zhu , Hai Yu","doi":"10.1016/j.asoc.2025.113509","DOIUrl":null,"url":null,"abstract":"<div><div>The generation of images from a single natural image/video has garnered significant attention due to its broad applications. However, existing methods training on a single input image or video face two key limitations. First, GAN-based approaches, relying on multiple models trained at progressively increasing scales, often lead to error accumulation and artifacts in generated results. Second, while diffusion models offer superior quality and diversity, they require extensive training time for a single input and are limited to generation tasks without the ability to edit existing images or videos. To address these challenges, we propose a <strong><u>Uni</u></strong>fied Diffusi<strong><u>on</u></strong> Model for Single Image/Video Training, named Union, achieving a balanced trade-off between computational efficiency and visual quality. Specifically, we introduce: (1) a unified model trained at a single scale, avoiding the error accumulation seen in multi-scale models; and (2) a novel Receptive DDPM framework with convolutional spatiotemporal blocks (CS-Block) that learns patch distribution of a natural image rather than simple image replication. The CS-Block uses ConvNext and spatiotemporal attention mechanisms to capture local and global relationships in temporal and frequency domains, enabling efficient adaptation to the patch-level receptive field of natural images and videos. Extensive experiments across image and video tasks demonstrate that Union outperforms other methods, achieving the best LPIPS score on the public Places50 dataset and excelling in high-resolution video generation, providing an optimal balance between computational cost and performance. The training and generated images/videos are available at: <span><span>https://github.com/hylneu/union.git</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50737,"journal":{"name":"Applied Soft Computing","volume":"181 ","pages":"Article 113509"},"PeriodicalIF":7.2000,"publicationDate":"2025-06-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Applied Soft Computing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1568494625008208","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
The generation of images from a single natural image/video has garnered significant attention due to its broad applications. However, existing methods training on a single input image or video face two key limitations. First, GAN-based approaches, relying on multiple models trained at progressively increasing scales, often lead to error accumulation and artifacts in generated results. Second, while diffusion models offer superior quality and diversity, they require extensive training time for a single input and are limited to generation tasks without the ability to edit existing images or videos. To address these challenges, we propose a Unified Diffusion Model for Single Image/Video Training, named Union, achieving a balanced trade-off between computational efficiency and visual quality. Specifically, we introduce: (1) a unified model trained at a single scale, avoiding the error accumulation seen in multi-scale models; and (2) a novel Receptive DDPM framework with convolutional spatiotemporal blocks (CS-Block) that learns patch distribution of a natural image rather than simple image replication. The CS-Block uses ConvNext and spatiotemporal attention mechanisms to capture local and global relationships in temporal and frequency domains, enabling efficient adaptation to the patch-level receptive field of natural images and videos. Extensive experiments across image and video tasks demonstrate that Union outperforms other methods, achieving the best LPIPS score on the public Places50 dataset and excelling in high-resolution video generation, providing an optimal balance between computational cost and performance. The training and generated images/videos are available at: https://github.com/hylneu/union.git.
期刊介绍:
Applied Soft Computing is an international journal promoting an integrated view of soft computing to solve real life problems.The focus is to publish the highest quality research in application and convergence of the areas of Fuzzy Logic, Neural Networks, Evolutionary Computing, Rough Sets and other similar techniques to address real world complexities.
Applied Soft Computing is a rolling publication: articles are published as soon as the editor-in-chief has accepted them. Therefore, the web site will continuously be updated with new articles and the publication time will be short.