DreamStory: Open-Domain Story Visualization by LLM-Guided Multi-Subject Consistent Diffusion.

IF 18.6

IEEE transactions on pattern analysis and machine intelligence Pub Date : 2025-08-19 DOI:10.1109/TPAMI.2025.3600149

Huiguo He, Huan Yang, Zixi Tuo, Yuan Zhou, Qiuyue Wang, Yuhang Zhang, Zeyu Liu, Wenhao Huang, Hongyang Chao, Jian Yin

{"title":"DreamStory: Open-Domain Story Visualization by LLM-Guided Multi-Subject Consistent Diffusion.","authors":"Huiguo He, Huan Yang, Zixi Tuo, Yuan Zhou, Qiuyue Wang, Yuhang Zhang, Zeyu Liu, Wenhao Huang, Hongyang Chao, Jian Yin","doi":"10.1109/TPAMI.2025.3600149","DOIUrl":null,"url":null,"abstract":"<p><p>Story visualization aims to create visually compelling images or videos corresponding to textual narratives. Despite recent advances in diffusion models yielding promising results, existing methods still struggle to create a coherent sequence of subject-consistent frames based solely on a story. To this end, we propose DreamStory, an automatic open-domain story visualization framework by leveraging the LLMs and a novel multi-subject consistent diffusion model. DreamStory consists of (1) an LLM acting as a story director and (2) an innovative Multi-Subject consistent Diffusion model (MSD) for generating consistent multi-subject across the images. First, DreamStory employs the LLM to generate descriptive prompts for subjects and scenes aligned with the story, annotating each scene's subjects for subsequent subject-consistent generation. Second, DreamStory utilizes these detailed subject descriptions to create portraits of the subjects, with these portraits and their corresponding textual information serving as multimodal anchors (guidance). Finally, the MSD uses these multimodal anchors to generate story scenes with consistent multi-subject. Specifically, the MSD includes Masked Mutual Self-Attention (MMSA) and Masked Mutual Cross-Attention (MMCA) modules. MMSA module ensures detailed appearance consistency with reference images, while MMCA captures key attributes of subjects from their reference text to ensure semantic consistency. Both modules employ masking mechanisms to restrict each scene's subjects to referencing the multimodal information of the corresponding subject, effectively preventing blending between multiple subjects. To validate our approach and promote progress in story visualization, we established a benchmark, DS-500, which can assess the overall performance of the story visualization framework, subject-identification accuracy, and the consistency of the generation model. Extensive experiments validate the effectiveness of DreamStory in both subjective and objective evaluations. Please visit our project homepage at https://dream-xyz.github.io/dreamstory.</p>","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"PP ","pages":""},"PeriodicalIF":18.6000,"publicationDate":"2025-08-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on pattern analysis and machine intelligence","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/TPAMI.2025.3600149","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Story visualization aims to create visually compelling images or videos corresponding to textual narratives. Despite recent advances in diffusion models yielding promising results, existing methods still struggle to create a coherent sequence of subject-consistent frames based solely on a story. To this end, we propose DreamStory, an automatic open-domain story visualization framework by leveraging the LLMs and a novel multi-subject consistent diffusion model. DreamStory consists of (1) an LLM acting as a story director and (2) an innovative Multi-Subject consistent Diffusion model (MSD) for generating consistent multi-subject across the images. First, DreamStory employs the LLM to generate descriptive prompts for subjects and scenes aligned with the story, annotating each scene's subjects for subsequent subject-consistent generation. Second, DreamStory utilizes these detailed subject descriptions to create portraits of the subjects, with these portraits and their corresponding textual information serving as multimodal anchors (guidance). Finally, the MSD uses these multimodal anchors to generate story scenes with consistent multi-subject. Specifically, the MSD includes Masked Mutual Self-Attention (MMSA) and Masked Mutual Cross-Attention (MMCA) modules. MMSA module ensures detailed appearance consistency with reference images, while MMCA captures key attributes of subjects from their reference text to ensure semantic consistency. Both modules employ masking mechanisms to restrict each scene's subjects to referencing the multimodal information of the corresponding subject, effectively preventing blending between multiple subjects. To validate our approach and promote progress in story visualization, we established a benchmark, DS-500, which can assess the overall performance of the story visualization framework, subject-identification accuracy, and the consistency of the generation model. Extensive experiments validate the effectiveness of DreamStory in both subjective and objective evaluations. Please visit our project homepage at https://dream-xyz.github.io/dreamstory.

查看原文本刊更多论文

DreamStory：由法学硕士引导的多主题一致扩散的开放域故事可视化。

故事可视化旨在创造与文本叙事相对应的视觉上引人注目的图像或视频。尽管最近在扩散模型方面取得了进展，产生了有希望的结果，但现有的方法仍然很难仅仅基于一个故事来创建一个主题一致的框架的连贯序列。为此，我们提出了DreamStory，这是一个利用llm和一个新的多主题一致扩散模型的自动开放域故事可视化框架。DreamStory由(1)作为故事导演的法学硕士和(2)创新的多主题一致扩散模型（MSD）组成，用于在图像中生成一致的多主题。首先，DreamStory使用LLM为与故事一致的主题和场景生成描述性提示，为后续主题一致的生成注释每个场景的主题。其次，DreamStory利用这些详细的主题描述来创建主题肖像，这些肖像和相应的文本信息作为多模态锚（引导）。最后，MSD使用这些多模态锚生成具有一致多主体的故事场景。具体来说，MSD包括屏蔽相互自注意（MMSA）和屏蔽相互交叉注意（MMCA）模块。MMSA模块确保与参考图像的详细外观一致性，而MMCA模块从参考文本中捕获主题的关键属性以确保语义一致性。两个模块都采用掩蔽机制，限制每个场景的主体只能引用对应主体的多模态信息，有效防止多个主体之间的混合。为了验证我们的方法并促进故事可视化的进展，我们建立了一个基准DS-500，它可以评估故事可视化框架的整体性能、主题识别准确性和生成模型的一致性。大量的实验验证了DreamStory在主观和客观评价方面的有效性。请访问我们的项目主页https://dream-xyz.github.io/dreamstory。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE transactions on pattern analysis and machine intelligence

自引率

0.00%

发文量