使用卷积时空块的接受扩散模型生成单幅图像和视频

IF 7.2 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Applied Soft Computing Pub Date : 2025-06-30 DOI:10.1016/j.asoc.2025.113509

Yingli Hou , Wei Zhang , Zhiliang Zhu , Hai Yu

{"title":"使用卷积时空块的接受扩散模型生成单幅图像和视频","authors":"Yingli Hou , Wei Zhang , Zhiliang Zhu , Hai Yu","doi":"10.1016/j.asoc.2025.113509","DOIUrl":null,"url":null,"abstract":"<div><div>The generation of images from a single natural image/video has garnered significant attention due to its broad applications. However, existing methods training on a single input image or video face two key limitations. First, GAN-based approaches, relying on multiple models trained at progressively increasing scales, often lead to error accumulation and artifacts in generated results. Second, while diffusion models offer superior quality and diversity, they require extensive training time for a single input and are limited to generation tasks without the ability to edit existing images or videos. To address these challenges, we propose a <strong><u>Uni</u></strong>fied Diffusi<strong><u>on</u></strong> Model for Single Image/Video Training, named Union, achieving a balanced trade-off between computational efficiency and visual quality. Specifically, we introduce: (1) a unified model trained at a single scale, avoiding the error accumulation seen in multi-scale models; and (2) a novel Receptive DDPM framework with convolutional spatiotemporal blocks (CS-Block) that learns patch distribution of a natural image rather than simple image replication. The CS-Block uses ConvNext and spatiotemporal attention mechanisms to capture local and global relationships in temporal and frequency domains, enabling efficient adaptation to the patch-level receptive field of natural images and videos. Extensive experiments across image and video tasks demonstrate that Union outperforms other methods, achieving the best LPIPS score on the public Places50 dataset and excelling in high-resolution video generation, providing an optimal balance between computational cost and performance. The training and generated images/videos are available at: <span><span>https://github.com/hylneu/union.git</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50737,"journal":{"name":"Applied Soft Computing","volume":"181 ","pages":"Article 113509"},"PeriodicalIF":7.2000,"publicationDate":"2025-06-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Single image and video generation using a receptive diffusion model with convolutional spatiotemporal blocks\",\"authors\":\"Yingli Hou , Wei Zhang , Zhiliang Zhu , Hai Yu\",\"doi\":\"10.1016/j.asoc.2025.113509\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>The generation of images from a single natural image/video has garnered significant attention due to its broad applications. However, existing methods training on a single input image or video face two key limitations. First, GAN-based approaches, relying on multiple models trained at progressively increasing scales, often lead to error accumulation and artifacts in generated results. Second, while diffusion models offer superior quality and diversity, they require extensive training time for a single input and are limited to generation tasks without the ability to edit existing images or videos. To address these challenges, we propose a <strong><u>Uni</u></strong>fied Diffusi<strong><u>on</u></strong> Model for Single Image/Video Training, named Union, achieving a balanced trade-off between computational efficiency and visual quality. Specifically, we introduce: (1) a unified model trained at a single scale, avoiding the error accumulation seen in multi-scale models; and (2) a novel Receptive DDPM framework with convolutional spatiotemporal blocks (CS-Block) that learns patch distribution of a natural image rather than simple image replication. The CS-Block uses ConvNext and spatiotemporal attention mechanisms to capture local and global relationships in temporal and frequency domains, enabling efficient adaptation to the patch-level receptive field of natural images and videos. Extensive experiments across image and video tasks demonstrate that Union outperforms other methods, achieving the best LPIPS score on the public Places50 dataset and excelling in high-resolution video generation, providing an optimal balance between computational cost and performance. The training and generated images/videos are available at: <span><span>https://github.com/hylneu/union.git</span><svg><path></path></svg></span>.</div></div>\",\"PeriodicalId\":50737,\"journal\":{\"name\":\"Applied Soft Computing\",\"volume\":\"181 \",\"pages\":\"Article 113509\"},\"PeriodicalIF\":7.2000,\"publicationDate\":\"2025-06-30\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Applied Soft Computing\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S1568494625008208\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Applied Soft Computing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1568494625008208","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

从单个自然图像/视频中生成图像由于其广泛的应用而引起了极大的关注。然而，现有的基于单一输入图像或视频的训练方法面临两个关键的限制。首先，基于gan的方法依赖于在逐渐增加的尺度上训练的多个模型，通常会导致错误积累和生成结果中的伪像。其次，虽然扩散模型提供了卓越的质量和多样性，但它们需要大量的训练时间来进行单个输入，并且仅限于生成任务，而不能编辑现有的图像或视频。为了解决这些挑战，我们提出了一种用于单图像/视频训练的统一扩散模型，称为Union，实现了计算效率和视觉质量之间的平衡权衡。具体来说，我们介绍了：(1)在单一尺度上训练的统一模型，避免了多尺度模型中出现的误差积累；(2)基于卷积时空块（CS-Block）的新型接受性DDPM框架，该框架学习自然图像的斑块分布，而不是简单的图像复制。CS-Block使用ConvNext和时空注意机制来捕获时间域和频域的局部和全局关系，从而能够有效地适应自然图像和视频的斑块级接受场。在图像和视频任务中进行的大量实验表明，Union优于其他方法，在公共Places50数据集上获得了最佳的LPIPS分数，并且在高分辨率视频生成方面表现出色，在计算成本和性能之间提供了最佳平衡。培训和生成的图像/视频可在https://github.com/hylneu/union.git上获得。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

Single image and video generation using a receptive diffusion model with convolutional spatiotemporal blocks

查看原文本刊更多论文

Single image and video generation using a receptive diffusion model with convolutional spatiotemporal blocks

The generation of images from a single natural image/video has garnered significant attention due to its broad applications. However, existing methods training on a single input image or video face two key limitations. First, GAN-based approaches, relying on multiple models trained at progressively increasing scales, often lead to error accumulation and artifacts in generated results. Second, while diffusion models offer superior quality and diversity, they require extensive training time for a single input and are limited to generation tasks without the ability to edit existing images or videos. To address these challenges, we propose a Unified Diffusion Model for Single Image/Video Training, named Union, achieving a balanced trade-off between computational efficiency and visual quality. Specifically, we introduce: (1) a unified model trained at a single scale, avoiding the error accumulation seen in multi-scale models; and (2) a novel Receptive DDPM framework with convolutional spatiotemporal blocks (CS-Block) that learns patch distribution of a natural image rather than simple image replication. The CS-Block uses ConvNext and spatiotemporal attention mechanisms to capture local and global relationships in temporal and frequency domains, enabling efficient adaptation to the patch-level receptive field of natural images and videos. Extensive experiments across image and video tasks demonstrate that Union outperforms other methods, achieving the best LPIPS score on the public Places50 dataset and excelling in high-resolution video generation, providing an optimal balance between computational cost and performance. The training and generated images/videos are available at: https://github.com/hylneu/union.git.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Applied Soft Computing 工程技术-计算机：跨学科应用

CiteScore

15.80

自引率

6.90%

发文量

874

审稿时长

10.9 months

期刊介绍： Applied Soft Computing is an international journal promoting an integrated view of soft computing to solve real life problems.The focus is to publish the highest quality research in application and convergence of the areas of Fuzzy Logic, Neural Networks, Evolutionary Computing, Rough Sets and other similar techniques to address real world complexities. Applied Soft Computing is a rolling publication: articles are published as soon as the editor-in-chief has accepted them. Therefore, the web site will continuously be updated with new articles and the publication time will be short.