PreciseVideo: a dual-process framework for zero-shot text-to-video generation with quantitative content control

IF 15.5 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE
Information Fusion Pub Date : 2026-05-01 Epub Date: 2025-12-05 DOI:10.1016/j.inffus.2025.104030
Lizhi Dang , Ting Liang , Huixin Zhang , Ruihao Zhang , Yingping Hong
{"title":"PreciseVideo: a dual-process framework for zero-shot text-to-video generation with quantitative content control","authors":"Lizhi Dang ,&nbsp;Ting Liang ,&nbsp;Huixin Zhang ,&nbsp;Ruihao Zhang ,&nbsp;Yingping Hong","doi":"10.1016/j.inffus.2025.104030","DOIUrl":null,"url":null,"abstract":"<div><div>Text-to-video (T2V) generation has recently gained significant attention, yet existing methods primarily focus on global temporal consistency and lack fine-grained, element-wise control over background dynamics and character behaviors. We propose <strong>PreciseVideo</strong>, a zero-shot T2V framework that enables controllable video synthesis at both the background and foreground levels. PreciseVideo introduces a dual-stage generation paradigm, separating background and character synthesis, and incorporates three novel modules: the <em>Region-Independent Noise Modulator</em> for quantifiable, region-wise temporal dynamics, <em>Sparse Fusion Attention</em> for structured cross-frame coherence, and <em>Optimal-Reference-Frame Attention</em> to preserve full-body character identity and appearance. This modular design ensures high-fidelity, temporally coherent, and behaviorally consistent video generation, even in complex multi-character scenarios. Extensive experiments demonstrate that PreciseVideo excels in element-wise controllability, character quantity accuracy, and multi-character scene synthesis compared with both zero-shot and training-based baselines. Ablation studies validate the effectiveness of each proposed module, while additional evaluations on scene-to-character and inter-character occlusions highlight the framework’s robustness and flexibility. Collectively, our results establish PreciseVideo as a highly controllable and scalable T2V approach, filling a critical gap in fine-grained, element-wise controllable video generation and setting a foundation for future advances in complex scene synthesis. Our code and related experimental results are available at <span><span>https://github.com/GG-Bond2023/PreciseVideo</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50367,"journal":{"name":"Information Fusion","volume":"129 ","pages":"Article 104030"},"PeriodicalIF":15.5000,"publicationDate":"2026-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information Fusion","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1566253525010929","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/12/5 0:00:00","PubModel":"Epub","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0

Abstract

Text-to-video (T2V) generation has recently gained significant attention, yet existing methods primarily focus on global temporal consistency and lack fine-grained, element-wise control over background dynamics and character behaviors. We propose PreciseVideo, a zero-shot T2V framework that enables controllable video synthesis at both the background and foreground levels. PreciseVideo introduces a dual-stage generation paradigm, separating background and character synthesis, and incorporates three novel modules: the Region-Independent Noise Modulator for quantifiable, region-wise temporal dynamics, Sparse Fusion Attention for structured cross-frame coherence, and Optimal-Reference-Frame Attention to preserve full-body character identity and appearance. This modular design ensures high-fidelity, temporally coherent, and behaviorally consistent video generation, even in complex multi-character scenarios. Extensive experiments demonstrate that PreciseVideo excels in element-wise controllability, character quantity accuracy, and multi-character scene synthesis compared with both zero-shot and training-based baselines. Ablation studies validate the effectiveness of each proposed module, while additional evaluations on scene-to-character and inter-character occlusions highlight the framework’s robustness and flexibility. Collectively, our results establish PreciseVideo as a highly controllable and scalable T2V approach, filling a critical gap in fine-grained, element-wise controllable video generation and setting a foundation for future advances in complex scene synthesis. Our code and related experimental results are available at https://github.com/GG-Bond2023/PreciseVideo.
精确视频:零镜头文本到视频生成与定量内容控制的双进程框架
文本到视频(T2V)的生成最近获得了极大的关注,但现有的方法主要关注全局时间一致性,缺乏对背景动态和角色行为的细粒度、元素明智的控制。我们提出了PreciseVideo,这是一个零镜头T2V框架,可以在背景和前景级别进行可控的视频合成。PreciseVideo引入了一种双阶段生成范式,将背景和字符合成分离,并结合了三个新模块:用于可量化、区域相关时间动态的区域无关噪声调制器,用于结构化跨帧相干的稀疏融合注意,以及用于保持全身字符身份和外观的最佳参考帧注意。这种模块化设计确保了高保真度、时间一致性和行为一致性的视频生成,即使在复杂的多字符场景中也是如此。大量的实验表明,与零射击和基于训练的基线相比,PreciseVideo在元素可控性、字符数量准确性和多字符场景合成方面表现出色。消融研究验证了每个提出模块的有效性,而对场景到字符和字符间遮挡的额外评估突出了框架的鲁棒性和灵活性。总之,我们的研究结果将PreciseVideo确立为一种高度可控和可扩展的T2V方法,填补了细粒度、元素可控视频生成的关键空白,并为未来复杂场景合成的发展奠定了基础。我们的代码和相关的实验结果可以在https://github.com/GG-Bond2023/PreciseVideo上找到。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Information Fusion
Information Fusion 工程技术-计算机:理论方法
CiteScore
33.20
自引率
4.30%
发文量
161
审稿时长
7.9 months
期刊介绍: Information Fusion serves as a central platform for showcasing advancements in multi-sensor, multi-source, multi-process information fusion, fostering collaboration among diverse disciplines driving its progress. It is the leading outlet for sharing research and development in this field, focusing on architectures, algorithms, and applications. Papers dealing with fundamental theoretical analyses as well as those demonstrating their application to real-world problems will be welcome.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信
小红书