PreciseVideo: a dual-process framework for zero-shot text-to-video generation with quantitative content control

IF 15.5 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Information Fusion Pub Date : 2026-05-01 Epub Date: 2025-12-05 DOI:10.1016/j.inffus.2025.104030

Lizhi Dang , Ting Liang , Huixin Zhang , Ruihao Zhang , Yingping Hong

{"title":"PreciseVideo: a dual-process framework for zero-shot text-to-video generation with quantitative content control","authors":"Lizhi Dang , Ting Liang , Huixin Zhang , Ruihao Zhang , Yingping Hong","doi":"10.1016/j.inffus.2025.104030","DOIUrl":null,"url":null,"abstract":"<div><div>Text-to-video (T2V) generation has recently gained significant attention, yet existing methods primarily focus on global temporal consistency and lack fine-grained, element-wise control over background dynamics and character behaviors. We propose <strong>PreciseVideo</strong>, a zero-shot T2V framework that enables controllable video synthesis at both the background and foreground levels. PreciseVideo introduces a dual-stage generation paradigm, separating background and character synthesis, and incorporates three novel modules: the <em>Region-Independent Noise Modulator</em> for quantifiable, region-wise temporal dynamics, <em>Sparse Fusion Attention</em> for structured cross-frame coherence, and <em>Optimal-Reference-Frame Attention</em> to preserve full-body character identity and appearance. This modular design ensures high-fidelity, temporally coherent, and behaviorally consistent video generation, even in complex multi-character scenarios. Extensive experiments demonstrate that PreciseVideo excels in element-wise controllability, character quantity accuracy, and multi-character scene synthesis compared with both zero-shot and training-based baselines. Ablation studies validate the effectiveness of each proposed module, while additional evaluations on scene-to-character and inter-character occlusions highlight the framework’s robustness and flexibility. Collectively, our results establish PreciseVideo as a highly controllable and scalable T2V approach, filling a critical gap in fine-grained, element-wise controllable video generation and setting a foundation for future advances in complex scene synthesis. Our code and related experimental results are available at <span><span>https://github.com/GG-Bond2023/PreciseVideo</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50367,"journal":{"name":"Information Fusion","volume":"129 ","pages":"Article 104030"},"PeriodicalIF":15.5000,"publicationDate":"2026-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information Fusion","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1566253525010929","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/12/5 0:00:00","PubModel":"Epub","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Text-to-video (T2V) generation has recently gained significant attention, yet existing methods primarily focus on global temporal consistency and lack fine-grained, element-wise control over background dynamics and character behaviors. We propose PreciseVideo, a zero-shot T2V framework that enables controllable video synthesis at both the background and foreground levels. PreciseVideo introduces a dual-stage generation paradigm, separating background and character synthesis, and incorporates three novel modules: the Region-Independent Noise Modulator for quantifiable, region-wise temporal dynamics, Sparse Fusion Attention for structured cross-frame coherence, and Optimal-Reference-Frame Attention to preserve full-body character identity and appearance. This modular design ensures high-fidelity, temporally coherent, and behaviorally consistent video generation, even in complex multi-character scenarios. Extensive experiments demonstrate that PreciseVideo excels in element-wise controllability, character quantity accuracy, and multi-character scene synthesis compared with both zero-shot and training-based baselines. Ablation studies validate the effectiveness of each proposed module, while additional evaluations on scene-to-character and inter-character occlusions highlight the framework’s robustness and flexibility. Collectively, our results establish PreciseVideo as a highly controllable and scalable T2V approach, filling a critical gap in fine-grained, element-wise controllable video generation and setting a foundation for future advances in complex scene synthesis. Our code and related experimental results are available at https://github.com/GG-Bond2023/PreciseVideo.

查看原文本刊更多论文

精确视频：零镜头文本到视频生成与定量内容控制的双进程框架

文本到视频（T2V）的生成最近获得了极大的关注，但现有的方法主要关注全局时间一致性，缺乏对背景动态和角色行为的细粒度、元素明智的控制。我们提出了PreciseVideo，这是一个零镜头T2V框架，可以在背景和前景级别进行可控的视频合成。PreciseVideo引入了一种双阶段生成范式，将背景和字符合成分离，并结合了三个新模块：用于可量化、区域相关时间动态的区域无关噪声调制器，用于结构化跨帧相干的稀疏融合注意，以及用于保持全身字符身份和外观的最佳参考帧注意。这种模块化设计确保了高保真度、时间一致性和行为一致性的视频生成，即使在复杂的多字符场景中也是如此。大量的实验表明，与零射击和基于训练的基线相比，PreciseVideo在元素可控性、字符数量准确性和多字符场景合成方面表现出色。消融研究验证了每个提出模块的有效性，而对场景到字符和字符间遮挡的额外评估突出了框架的鲁棒性和灵活性。总之，我们的研究结果将PreciseVideo确立为一种高度可控和可扩展的T2V方法，填补了细粒度、元素可控视频生成的关键空白，并为未来复杂场景合成的发展奠定了基础。我们的代码和相关的实验结果可以在https://github.com/GG-Bond2023/PreciseVideo上找到。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Information Fusion 工程技术-计算机：理论方法

CiteScore

33.20

自引率

4.30%

发文量

161

审稿时长

7.9 months

期刊介绍： Information Fusion serves as a central platform for showcasing advancements in multi-sensor, multi-source, multi-process information fusion, fostering collaboration among diverse disciplines driving its progress. It is the leading outlet for sharing research and development in this field, focusing on architectures, algorithms, and applications. Papers dealing with fundamental theoretical analyses as well as those demonstrating their application to real-world problems will be welcome.