{"title":"PreciseVideo: a dual-process framework for zero-shot text-to-video generation with quantitative content control","authors":"Lizhi Dang , Ting Liang , Huixin Zhang , Ruihao Zhang , Yingping Hong","doi":"10.1016/j.inffus.2025.104030","DOIUrl":null,"url":null,"abstract":"<div><div>Text-to-video (T2V) generation has recently gained significant attention, yet existing methods primarily focus on global temporal consistency and lack fine-grained, element-wise control over background dynamics and character behaviors. We propose <strong>PreciseVideo</strong>, a zero-shot T2V framework that enables controllable video synthesis at both the background and foreground levels. PreciseVideo introduces a dual-stage generation paradigm, separating background and character synthesis, and incorporates three novel modules: the <em>Region-Independent Noise Modulator</em> for quantifiable, region-wise temporal dynamics, <em>Sparse Fusion Attention</em> for structured cross-frame coherence, and <em>Optimal-Reference-Frame Attention</em> to preserve full-body character identity and appearance. This modular design ensures high-fidelity, temporally coherent, and behaviorally consistent video generation, even in complex multi-character scenarios. Extensive experiments demonstrate that PreciseVideo excels in element-wise controllability, character quantity accuracy, and multi-character scene synthesis compared with both zero-shot and training-based baselines. Ablation studies validate the effectiveness of each proposed module, while additional evaluations on scene-to-character and inter-character occlusions highlight the framework’s robustness and flexibility. Collectively, our results establish PreciseVideo as a highly controllable and scalable T2V approach, filling a critical gap in fine-grained, element-wise controllable video generation and setting a foundation for future advances in complex scene synthesis. Our code and related experimental results are available at <span><span>https://github.com/GG-Bond2023/PreciseVideo</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50367,"journal":{"name":"Information Fusion","volume":"129 ","pages":"Article 104030"},"PeriodicalIF":15.5000,"publicationDate":"2026-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information Fusion","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1566253525010929","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/12/5 0:00:00","PubModel":"Epub","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Text-to-video (T2V) generation has recently gained significant attention, yet existing methods primarily focus on global temporal consistency and lack fine-grained, element-wise control over background dynamics and character behaviors. We propose PreciseVideo, a zero-shot T2V framework that enables controllable video synthesis at both the background and foreground levels. PreciseVideo introduces a dual-stage generation paradigm, separating background and character synthesis, and incorporates three novel modules: the Region-Independent Noise Modulator for quantifiable, region-wise temporal dynamics, Sparse Fusion Attention for structured cross-frame coherence, and Optimal-Reference-Frame Attention to preserve full-body character identity and appearance. This modular design ensures high-fidelity, temporally coherent, and behaviorally consistent video generation, even in complex multi-character scenarios. Extensive experiments demonstrate that PreciseVideo excels in element-wise controllability, character quantity accuracy, and multi-character scene synthesis compared with both zero-shot and training-based baselines. Ablation studies validate the effectiveness of each proposed module, while additional evaluations on scene-to-character and inter-character occlusions highlight the framework’s robustness and flexibility. Collectively, our results establish PreciseVideo as a highly controllable and scalable T2V approach, filling a critical gap in fine-grained, element-wise controllable video generation and setting a foundation for future advances in complex scene synthesis. Our code and related experimental results are available at https://github.com/GG-Bond2023/PreciseVideo.
期刊介绍:
Information Fusion serves as a central platform for showcasing advancements in multi-sensor, multi-source, multi-process information fusion, fostering collaboration among diverse disciplines driving its progress. It is the leading outlet for sharing research and development in this field, focusing on architectures, algorithms, and applications. Papers dealing with fundamental theoretical analyses as well as those demonstrating their application to real-world problems will be welcome.