PromptCharm: Text-to-Image Generation through Multi-modal Prompting and Refinement

ArXiv Pub Date : 2024-03-06 DOI:10.1145/3613904.3642803

Zhijie Wang, Yuheng Huang, Da Song, Lei Ma, Tianyi Zhang

{"title":"PromptCharm: Text-to-Image Generation through Multi-modal Prompting and Refinement","authors":"Zhijie Wang, Yuheng Huang, Da Song, Lei Ma, Tianyi Zhang","doi":"10.1145/3613904.3642803","DOIUrl":null,"url":null,"abstract":"The recent advancements in Generative AI have significantly advanced the field of text-to-image generation. The state-of-the-art text-to-image model, Stable Diffusion, is now capable of synthesizing high-quality images with a strong sense of aesthetics. Crafting text prompts that align with the model's interpretation and the user's intent thus becomes crucial. However, prompting remains challenging for novice users due to the complexity of the stable diffusion model and the non-trivial efforts required for iteratively editing and refining the text prompts. To address these challenges, we propose PromptCharm, a mixed-initiative system that facilitates text-to-image creation through multi-modal prompt engineering and refinement. To assist novice users in prompting, PromptCharm first automatically refines and optimizes the user's initial prompt. Furthermore, PromptCharm supports the user in exploring and selecting different image styles within a large database. To assist users in effectively refining their prompts and images, PromptCharm renders model explanations by visualizing the model's attention values. If the user notices any unsatisfactory areas in the generated images, they can further refine the images through model attention adjustment or image inpainting within the rich feedback loop of PromptCharm. To evaluate the effectiveness and usability of PromptCharm, we conducted a controlled user study with 12 participants and an exploratory user study with another 12 participants. These two studies show that participants using PromptCharm were able to create images with higher quality and better aligned with the user's expectations compared with using two variants of PromptCharm that lacked interaction or visualization support.","PeriodicalId":513202,"journal":{"name":"ArXiv","volume":"3 4","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-03-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ArXiv","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3613904.3642803","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

The recent advancements in Generative AI have significantly advanced the field of text-to-image generation. The state-of-the-art text-to-image model, Stable Diffusion, is now capable of synthesizing high-quality images with a strong sense of aesthetics. Crafting text prompts that align with the model's interpretation and the user's intent thus becomes crucial. However, prompting remains challenging for novice users due to the complexity of the stable diffusion model and the non-trivial efforts required for iteratively editing and refining the text prompts. To address these challenges, we propose PromptCharm, a mixed-initiative system that facilitates text-to-image creation through multi-modal prompt engineering and refinement. To assist novice users in prompting, PromptCharm first automatically refines and optimizes the user's initial prompt. Furthermore, PromptCharm supports the user in exploring and selecting different image styles within a large database. To assist users in effectively refining their prompts and images, PromptCharm renders model explanations by visualizing the model's attention values. If the user notices any unsatisfactory areas in the generated images, they can further refine the images through model attention adjustment or image inpainting within the rich feedback loop of PromptCharm. To evaluate the effectiveness and usability of PromptCharm, we conducted a controlled user study with 12 participants and an exploratory user study with another 12 participants. These two studies show that participants using PromptCharm were able to create images with higher quality and better aligned with the user's expectations compared with using two variants of PromptCharm that lacked interaction or visualization support.

查看原文本刊更多论文

PromptCharm：通过多模式提示和细化实现文本到图像的生成

生成式人工智能的最新进展极大地推动了文本到图像生成领域的发展。目前，最先进的文本到图像模型--稳定扩散模型--能够合成具有强烈美感的高质量图像。因此，制作符合模型解释和用户意图的文本提示就变得至关重要。然而，由于稳定扩散模型的复杂性以及反复编辑和完善文本提示所需的大量工作，提示对于新手用户来说仍然具有挑战性。为了应对这些挑战，我们提出了 PromptCharm，这是一个混合倡议系统，通过多模式提示工程和完善来促进文本到图像的创建。为了帮助新手用户进行提示，PromptCharm 首先会自动完善和优化用户的初始提示。此外，PromptCharm 还支持用户在大型数据库中探索和选择不同的图像风格。为了帮助用户有效改进提示和图像，PromptCharm 通过可视化模型的注意力值来渲染模型解释。如果用户在生成的图像中发现任何不满意的地方，他们可以在 PromptCharm 丰富的反馈回路中通过调整模型关注度或绘制图像来进一步完善图像。为了评估 PromptCharm 的有效性和可用性，我们对 12 名参与者进行了控制性用户研究，并对另外 12 名参与者进行了探索性用户研究。这两项研究表明，与使用两种缺乏交互或可视化支持的 PromptCharm 变体相比，使用 PromptCharm 的参与者能够创作出质量更高、更符合用户期望的图像。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

ArXiv

自引率

0.00%

发文量