OmniGen: Unified Image Generation

arXiv - CS - Computer Vision and Pattern Recognition Pub Date : 2024-09-17 DOI:arxiv-2409.11340

Shitao Xiao, Yueze Wang, Junjie Zhou, Huaying Yuan, Xingrun Xing, Ruiran Yan, Shuting Wang, Tiejun Huang, Zheng Liu

{"title":"OmniGen: Unified Image Generation","authors":"Shitao Xiao, Yueze Wang, Junjie Zhou, Huaying Yuan, Xingrun Xing, Ruiran Yan, Shuting Wang, Tiejun Huang, Zheng Liu","doi":"arxiv-2409.11340","DOIUrl":null,"url":null,"abstract":"In this work, we introduce OmniGen, a new diffusion model for unified image\ngeneration. Unlike popular diffusion models (e.g., Stable Diffusion), OmniGen\nno longer requires additional modules such as ControlNet or IP-Adapter to\nprocess diverse control conditions. OmniGenis characterized by the following\nfeatures: 1) Unification: OmniGen not only demonstrates text-to-image\ngeneration capabilities but also inherently supports other downstream tasks,\nsuch as image editing, subject-driven generation, and visual-conditional\ngeneration. Additionally, OmniGen can handle classical computer vision tasks by\ntransforming them into image generation tasks, such as edge detection and human\npose recognition. 2) Simplicity: The architecture of OmniGen is highly\nsimplified, eliminating the need for additional text encoders. Moreover, it is\nmore user-friendly compared to existing diffusion models, enabling complex\ntasks to be accomplished through instructions without the need for extra\npreprocessing steps (e.g., human pose estimation), thereby significantly\nsimplifying the workflow of image generation. 3) Knowledge Transfer: Through\nlearning in a unified format, OmniGen effectively transfers knowledge across\ndifferent tasks, manages unseen tasks and domains, and exhibits novel\ncapabilities. We also explore the model's reasoning capabilities and potential\napplications of chain-of-thought mechanism. This work represents the first\nattempt at a general-purpose image generation model, and there remain several\nunresolved issues. We will open-source the related resources at\nhttps://github.com/VectorSpaceLab/OmniGen to foster advancements in this field.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"205 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Computer Vision and Pattern Recognition","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.11340","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

In this work, we introduce OmniGen, a new diffusion model for unified image generation. Unlike popular diffusion models (e.g., Stable Diffusion), OmniGen no longer requires additional modules such as ControlNet or IP-Adapter to process diverse control conditions. OmniGenis characterized by the following features: 1) Unification: OmniGen not only demonstrates text-to-image generation capabilities but also inherently supports other downstream tasks, such as image editing, subject-driven generation, and visual-conditional generation. Additionally, OmniGen can handle classical computer vision tasks by transforming them into image generation tasks, such as edge detection and human pose recognition. 2) Simplicity: The architecture of OmniGen is highly simplified, eliminating the need for additional text encoders. Moreover, it is more user-friendly compared to existing diffusion models, enabling complex tasks to be accomplished through instructions without the need for extra preprocessing steps (e.g., human pose estimation), thereby significantly simplifying the workflow of image generation. 3) Knowledge Transfer: Through learning in a unified format, OmniGen effectively transfers knowledge across different tasks, manages unseen tasks and domains, and exhibits novel capabilities. We also explore the model's reasoning capabilities and potential applications of chain-of-thought mechanism. This work represents the first attempt at a general-purpose image generation model, and there remain several unresolved issues. We will open-source the related resources at https://github.com/VectorSpaceLab/OmniGen to foster advancements in this field.

查看原文本刊更多论文

OmniGen：统一图像生成

在这项工作中，我们介绍了用于统一图像生成的新型扩散模型 OmniGen。与流行的扩散模型（如稳定扩散）不同，OmniGen 不再需要控制网或 IP 适配器等附加模块来处理不同的控制条件。OmniGen 具有以下特点：1) 统一性：OmniGen 不仅展示了文本到图像的生成功能，而且还固有地支持其他下游任务，如图像编辑、主题驱动生成和视觉条件生成。此外，OmniGen 还能处理经典的计算机视觉任务，将其转换为图像生成任务，如边缘检测和人脸识别。2) 简单性：OmniGen 的架构高度简化，无需额外的文本编码器。此外，与现有的扩散模型相比，OmniGen 对用户更加友好，通过指令即可完成全部任务，无需额外的预处理步骤（如人体姿态估计），从而大大简化了图像生成的工作流程。3) 知识转移：通过以统一格式进行学习，OmniGen 可有效地在不同任务间转移知识，管理未见过的任务和领域，并展现出新颖的能力。我们还探索了模型的推理能力和思维链机制的潜在应用。这项工作是对通用图像生成模型的首次尝试，目前仍有几个问题尚未解决。我们将在 https://github.com/VectorSpaceLab/OmniGen 上开源相关资源，以促进该领域的进步。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

arXiv - CS - Computer Vision and Pattern Recognition

自引率

0.00%

发文量