{"title":"OmniGen:统一图像生成","authors":"Shitao Xiao, Yueze Wang, Junjie Zhou, Huaying Yuan, Xingrun Xing, Ruiran Yan, Shuting Wang, Tiejun Huang, Zheng Liu","doi":"arxiv-2409.11340","DOIUrl":null,"url":null,"abstract":"In this work, we introduce OmniGen, a new diffusion model for unified image\ngeneration. Unlike popular diffusion models (e.g., Stable Diffusion), OmniGen\nno longer requires additional modules such as ControlNet or IP-Adapter to\nprocess diverse control conditions. OmniGenis characterized by the following\nfeatures: 1) Unification: OmniGen not only demonstrates text-to-image\ngeneration capabilities but also inherently supports other downstream tasks,\nsuch as image editing, subject-driven generation, and visual-conditional\ngeneration. Additionally, OmniGen can handle classical computer vision tasks by\ntransforming them into image generation tasks, such as edge detection and human\npose recognition. 2) Simplicity: The architecture of OmniGen is highly\nsimplified, eliminating the need for additional text encoders. Moreover, it is\nmore user-friendly compared to existing diffusion models, enabling complex\ntasks to be accomplished through instructions without the need for extra\npreprocessing steps (e.g., human pose estimation), thereby significantly\nsimplifying the workflow of image generation. 3) Knowledge Transfer: Through\nlearning in a unified format, OmniGen effectively transfers knowledge across\ndifferent tasks, manages unseen tasks and domains, and exhibits novel\ncapabilities. We also explore the model's reasoning capabilities and potential\napplications of chain-of-thought mechanism. This work represents the first\nattempt at a general-purpose image generation model, and there remain several\nunresolved issues. We will open-source the related resources at\nhttps://github.com/VectorSpaceLab/OmniGen to foster advancements in this field.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"OmniGen: Unified Image Generation\",\"authors\":\"Shitao Xiao, Yueze Wang, Junjie Zhou, Huaying Yuan, Xingrun Xing, Ruiran Yan, Shuting Wang, Tiejun Huang, Zheng Liu\",\"doi\":\"arxiv-2409.11340\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In this work, we introduce OmniGen, a new diffusion model for unified image\\ngeneration. Unlike popular diffusion models (e.g., Stable Diffusion), OmniGen\\nno longer requires additional modules such as ControlNet or IP-Adapter to\\nprocess diverse control conditions. OmniGenis characterized by the following\\nfeatures: 1) Unification: OmniGen not only demonstrates text-to-image\\ngeneration capabilities but also inherently supports other downstream tasks,\\nsuch as image editing, subject-driven generation, and visual-conditional\\ngeneration. Additionally, OmniGen can handle classical computer vision tasks by\\ntransforming them into image generation tasks, such as edge detection and human\\npose recognition. 2) Simplicity: The architecture of OmniGen is highly\\nsimplified, eliminating the need for additional text encoders. Moreover, it is\\nmore user-friendly compared to existing diffusion models, enabling complex\\ntasks to be accomplished through instructions without the need for extra\\npreprocessing steps (e.g., human pose estimation), thereby significantly\\nsimplifying the workflow of image generation. 3) Knowledge Transfer: Through\\nlearning in a unified format, OmniGen effectively transfers knowledge across\\ndifferent tasks, manages unseen tasks and domains, and exhibits novel\\ncapabilities. We also explore the model's reasoning capabilities and potential\\napplications of chain-of-thought mechanism. This work represents the first\\nattempt at a general-purpose image generation model, and there remain several\\nunresolved issues. We will open-source the related resources at\\nhttps://github.com/VectorSpaceLab/OmniGen to foster advancements in this field.\",\"PeriodicalId\":501130,\"journal\":{\"name\":\"arXiv - CS - Computer Vision and Pattern Recognition\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-17\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Computer Vision and Pattern Recognition\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2409.11340\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Computer Vision and Pattern Recognition","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.11340","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
In this work, we introduce OmniGen, a new diffusion model for unified image
generation. Unlike popular diffusion models (e.g., Stable Diffusion), OmniGen
no longer requires additional modules such as ControlNet or IP-Adapter to
process diverse control conditions. OmniGenis characterized by the following
features: 1) Unification: OmniGen not only demonstrates text-to-image
generation capabilities but also inherently supports other downstream tasks,
such as image editing, subject-driven generation, and visual-conditional
generation. Additionally, OmniGen can handle classical computer vision tasks by
transforming them into image generation tasks, such as edge detection and human
pose recognition. 2) Simplicity: The architecture of OmniGen is highly
simplified, eliminating the need for additional text encoders. Moreover, it is
more user-friendly compared to existing diffusion models, enabling complex
tasks to be accomplished through instructions without the need for extra
preprocessing steps (e.g., human pose estimation), thereby significantly
simplifying the workflow of image generation. 3) Knowledge Transfer: Through
learning in a unified format, OmniGen effectively transfers knowledge across
different tasks, manages unseen tasks and domains, and exhibits novel
capabilities. We also explore the model's reasoning capabilities and potential
applications of chain-of-thought mechanism. This work represents the first
attempt at a general-purpose image generation model, and there remain several
unresolved issues. We will open-source the related resources at
https://github.com/VectorSpaceLab/OmniGen to foster advancements in this field.