ScenarioDiff: Text-to-video Generation with Dynamic Transformations of Scene Conditions

IF 9.3 2区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

International Journal of Computer Vision Pub Date : 2025-03-25 DOI:10.1007/s11263-025-02413-7

Yipeng Zhang, Xin Wang, Hong Chen, Chenyang Qin, Yibo Hao, Hong Mei, Wenwu Zhu

{"title":"ScenarioDiff: Text-to-video Generation with Dynamic Transformations of Scene Conditions","authors":"Yipeng Zhang, Xin Wang, Hong Chen, Chenyang Qin, Yibo Hao, Hong Mei, Wenwu Zhu","doi":"10.1007/s11263-025-02413-7","DOIUrl":null,"url":null,"abstract":"<p>With the development of diffusion models, text-to-video generation has recently received significant attention and achieved remarkable success. However, existing text-to-video approaches suffer from the following weaknesses: i) they fail to control the trajectory of the subject as well as the process of scene transformations; ii) they can only generate videos with limited frames, failing to capture the whole transformation process. To address these issues, we propose the model named ScenarioDiff, which is able to generate longer videos with scene transformations. Specifically, we employ a spatial layout fuser to control the positions of subjects and the scenes of each frame. To effectively present the process of scene transformation, we introduce mixed frequency controlnet, which utilizes several frames of the generated videos to extend them to long videos chunk by chunk in an auto-regressive manner. Additionally, to ensure consistency between different video chunks, we propose a cross-chunk scheduling mechanism during inference. Experimental results demonstrate the effectiveness of our approach in generating videos with dynamic scene transformations. Our project page is available at https://scenariodiff2024.github.io/.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"35 1","pages":""},"PeriodicalIF":9.3000,"publicationDate":"2025-03-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Computer Vision","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s11263-025-02413-7","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

With the development of diffusion models, text-to-video generation has recently received significant attention and achieved remarkable success. However, existing text-to-video approaches suffer from the following weaknesses: i) they fail to control the trajectory of the subject as well as the process of scene transformations; ii) they can only generate videos with limited frames, failing to capture the whole transformation process. To address these issues, we propose the model named ScenarioDiff, which is able to generate longer videos with scene transformations. Specifically, we employ a spatial layout fuser to control the positions of subjects and the scenes of each frame. To effectively present the process of scene transformation, we introduce mixed frequency controlnet, which utilizes several frames of the generated videos to extend them to long videos chunk by chunk in an auto-regressive manner. Additionally, to ensure consistency between different video chunks, we propose a cross-chunk scheduling mechanism during inference. Experimental results demonstrate the effectiveness of our approach in generating videos with dynamic scene transformations. Our project page is available at https://scenariodiff2024.github.io/.

查看原文本刊更多论文

场景diff：场景条件动态转换的文本到视频生成

随着扩散模型的发展，文本到视频的生成最近受到了极大的关注，并取得了显著的成功。然而，现有的文本视频生成方法存在以下缺陷：i) 无法控制主体的运动轨迹以及场景变换过程；ii) 只能生成有限帧的视频，无法捕捉整个变换过程。为了解决这些问题，我们提出了名为 ScenarioDiff 的模型，它能够生成带有场景变换的较长视频。具体来说，我们采用空间布局融合器来控制主体和每帧场景的位置。为了有效呈现场景转换过程，我们引入了混频控制网，利用生成视频的若干帧，以自动回归的方式将其逐块扩展为长视频。此外，为了确保不同视频块之间的一致性，我们在推理过程中提出了一种跨块调度机制。实验结果表明，我们的方法在生成动态场景转换视频时非常有效。我们的项目页面位于 https://scenariodiff2024.github.io/。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

International Journal of Computer Vision 工程技术-计算机：人工智能

CiteScore

29.80

自引率

2.10%

发文量

163

审稿时长

6 months

期刊介绍： The International Journal of Computer Vision (IJCV) serves as a platform for sharing new research findings in the rapidly growing field of computer vision. It publishes 12 issues annually and presents high-quality, original contributions to the science and engineering of computer vision. The journal encompasses various types of articles to cater to different research outputs. Regular articles, which span up to 25 journal pages, focus on significant technical advancements that are of broad interest to the field. These articles showcase substantial progress in computer vision. Short articles, limited to 10 pages, offer a swift publication path for novel research outcomes. They provide a quicker means for sharing new findings with the computer vision community. Survey articles, comprising up to 30 pages, offer critical evaluations of the current state of the art in computer vision or offer tutorial presentations of relevant topics. These articles provide comprehensive and insightful overviews of specific subject areas. In addition to technical articles, the journal also includes book reviews, position papers, and editorials by prominent scientific figures. These contributions serve to complement the technical content and provide valuable perspectives. The journal encourages authors to include supplementary material online, such as images, video sequences, data sets, and software. This additional material enhances the understanding and reproducibility of the published research. Overall, the International Journal of Computer Vision is a comprehensive publication that caters to researchers in this rapidly growing field. It covers a range of article types, offers additional online resources, and facilitates the dissemination of impactful research.