EZIGen：通过精确的主体编码和解耦引导，增强零镜头主体驱动图像生成功能

arXiv - CS - Computer Vision and Pattern Recognition Pub Date : 2024-09-12 DOI:arxiv-2409.08091

Zicheng Duan, Yuxuan Ding, Chenhui Gou, Ziqin Zhou, Ethan Smith, Lingqiao Liu

{"title":"EZIGen：通过精确的主体编码和解耦引导，增强零镜头主体驱动图像生成功能","authors":"Zicheng Duan, Yuxuan Ding, Chenhui Gou, Ziqin Zhou, Ethan Smith, Lingqiao Liu","doi":"arxiv-2409.08091","DOIUrl":null,"url":null,"abstract":"Zero-shot subject-driven image generation aims to produce images that\nincorporate a subject from a given example image. The challenge lies in\npreserving the subject's identity while aligning with the text prompt, which\noften requires modifying certain aspects of the subject's appearance. Despite\nadvancements in diffusion model based methods, existing approaches still\nstruggle to balance identity preservation with text prompt alignment. In this\nstudy, we conducted an in-depth investigation into this issue and uncovered key\ninsights for achieving effective identity preservation while maintaining a\nstrong balance. Our key findings include: (1) the design of the subject image\nencoder significantly impacts identity preservation quality, and (2) generating\nan initial layout is crucial for both text alignment and identity preservation.\nBuilding on these insights, we introduce a new approach called EZIGen, which\nemploys two main strategies: a carefully crafted subject image Encoder based on\nthe UNet architecture of the pretrained Stable Diffusion model to ensure\nhigh-quality identity transfer, following a process that decouples the guidance\nstages and iteratively refines the initial image layout. Through these\nstrategies, EZIGen achieves state-of-the-art results on multiple subject-driven\nbenchmarks with a unified model and 100 times less training data.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"24 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"EZIGen: Enhancing zero-shot subject-driven image generation with precise subject encoding and decoupled guidance\",\"authors\":\"Zicheng Duan, Yuxuan Ding, Chenhui Gou, Ziqin Zhou, Ethan Smith, Lingqiao Liu\",\"doi\":\"arxiv-2409.08091\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Zero-shot subject-driven image generation aims to produce images that\\nincorporate a subject from a given example image. The challenge lies in\\npreserving the subject's identity while aligning with the text prompt, which\\noften requires modifying certain aspects of the subject's appearance. Despite\\nadvancements in diffusion model based methods, existing approaches still\\nstruggle to balance identity preservation with text prompt alignment. In this\\nstudy, we conducted an in-depth investigation into this issue and uncovered key\\ninsights for achieving effective identity preservation while maintaining a\\nstrong balance. Our key findings include: (1) the design of the subject image\\nencoder significantly impacts identity preservation quality, and (2) generating\\nan initial layout is crucial for both text alignment and identity preservation.\\nBuilding on these insights, we introduce a new approach called EZIGen, which\\nemploys two main strategies: a carefully crafted subject image Encoder based on\\nthe UNet architecture of the pretrained Stable Diffusion model to ensure\\nhigh-quality identity transfer, following a process that decouples the guidance\\nstages and iteratively refines the initial image layout. Through these\\nstrategies, EZIGen achieves state-of-the-art results on multiple subject-driven\\nbenchmarks with a unified model and 100 times less training data.\",\"PeriodicalId\":501130,\"journal\":{\"name\":\"arXiv - CS - Computer Vision and Pattern Recognition\",\"volume\":\"24 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-12\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Computer Vision and Pattern Recognition\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2409.08091\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Computer Vision and Pattern Recognition","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.08091","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

零拍主体驱动图像生成的目的是根据给定的示例图像生成包含主体的图像。其难点在于如何在保持主体身份的同时与文本提示保持一致，这通常需要修改主体外观的某些方面。尽管基于扩散模型的方法取得了进步，但现有方法仍难以在保持身份和文本提示对齐之间取得平衡。在本研究中，我们对这一问题进行了深入调查，并发现了在保持有力平衡的同时实现有效身份保护的关键见解。我们的主要发现包括(基于这些见解，我们引入了一种名为 EZIGen 的新方法，该方法采用了两种主要策略：一种是基于预训练稳定扩散模型的 UNet 架构精心设计的主题图像编码器，以确保高质量的身份转移；另一种是解耦引导阶段并迭代完善初始图像布局的过程。通过这些策略，EZIGen 以统一的模型和减少 100 倍的训练数据，在多个主题驱动的基准测试中取得了最先进的结果。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

EZIGen: Enhancing zero-shot subject-driven image generation with precise subject encoding and decoupled guidance

Zero-shot subject-driven image generation aims to produce images that incorporate a subject from a given example image. The challenge lies in preserving the subject's identity while aligning with the text prompt, which often requires modifying certain aspects of the subject's appearance. Despite advancements in diffusion model based methods, existing approaches still struggle to balance identity preservation with text prompt alignment. In this study, we conducted an in-depth investigation into this issue and uncovered key insights for achieving effective identity preservation while maintaining a strong balance. Our key findings include: (1) the design of the subject image encoder significantly impacts identity preservation quality, and (2) generating an initial layout is crucial for both text alignment and identity preservation. Building on these insights, we introduce a new approach called EZIGen, which employs two main strategies: a carefully crafted subject image Encoder based on the UNet architecture of the pretrained Stable Diffusion model to ensure high-quality identity transfer, following a process that decouples the guidance stages and iteratively refines the initial image layout. Through these strategies, EZIGen achieves state-of-the-art results on multiple subject-driven benchmarks with a unified model and 100 times less training data.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

arXiv - CS - Computer Vision and Pattern Recognition

自引率

0.00%

发文量