arXiv - CS - Computer Vision and Pattern Recognition最新文献

筛选
英文 中文
Generalized Few-Shot Semantic Segmentation in Remote Sensing: Challenge and Benchmark 遥感中的广义少镜头语义分割:挑战与基准
arXiv - CS - Computer Vision and Pattern Recognition Pub Date : 2024-09-17 DOI: arxiv-2409.11227
Clifford Broni-Bediako, Junshi Xia, Jian Song, Hongruixuan Chen, Mennatullah Siam, Naoto Yokoya
{"title":"Generalized Few-Shot Semantic Segmentation in Remote Sensing: Challenge and Benchmark","authors":"Clifford Broni-Bediako, Junshi Xia, Jian Song, Hongruixuan Chen, Mennatullah Siam, Naoto Yokoya","doi":"arxiv-2409.11227","DOIUrl":"https://doi.org/arxiv-2409.11227","url":null,"abstract":"Learning with limited labelled data is a challenging problem in various\u0000applications, including remote sensing. Few-shot semantic segmentation is one\u0000approach that can encourage deep learning models to learn from few labelled\u0000examples for novel classes not seen during the training. The generalized\u0000few-shot segmentation setting has an additional challenge which encourages\u0000models not only to adapt to the novel classes but also to maintain strong\u0000performance on the training base classes. While previous datasets and\u0000benchmarks discussed the few-shot segmentation setting in remote sensing, we\u0000are the first to propose a generalized few-shot segmentation benchmark for\u0000remote sensing. The generalized setting is more realistic and challenging,\u0000which necessitates exploring it within the remote sensing context. We release\u0000the dataset augmenting OpenEarthMap with additional classes labelled for the\u0000generalized few-shot evaluation setting. The dataset is released during the\u0000OpenEarthMap land cover mapping generalized few-shot challenge in the L3D-IVU\u0000workshop in conjunction with CVPR 2024. In this work, we summarize the dataset\u0000and challenge details in addition to providing the benchmark results on the two\u0000phases of the challenge for the validation and test sets.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"15 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250671","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
NSSR-DIL: Null-Shot Image Super-Resolution Using Deep Identity Learning NSSR-DIL:利用深度身份学习实现空镜头图像超分辨率
arXiv - CS - Computer Vision and Pattern Recognition Pub Date : 2024-09-17 DOI: arxiv-2409.12165
Sree Rama Vamsidhar S, Rama Krishna Gorthi
{"title":"NSSR-DIL: Null-Shot Image Super-Resolution Using Deep Identity Learning","authors":"Sree Rama Vamsidhar S, Rama Krishna Gorthi","doi":"arxiv-2409.12165","DOIUrl":"https://doi.org/arxiv-2409.12165","url":null,"abstract":"The present State-of-the-Art (SotA) Image Super-Resolution (ISR) methods\u0000employ Deep Learning (DL) techniques using a large amount of image data. The\u0000primary limitation to extending the existing SotA ISR works for real-world\u0000instances is their computational and time complexities. In this paper, contrary\u0000to the existing methods, we present a novel and computationally efficient ISR\u0000algorithm that is independent of the image dataset to learn the ISR task. The\u0000proposed algorithm reformulates the ISR task from generating the Super-Resolved\u0000(SR) images to computing the inverse of the kernels that span the degradation\u0000space. We introduce Deep Identity Learning, exploiting the identity relation\u0000between the degradation and inverse degradation models. The proposed approach\u0000neither relies on the ISR dataset nor on a single input low-resolution (LR)\u0000image (like the self-supervised method i.e. ZSSR) to model the ISR task. Hence\u0000we term our model as Null-Shot Super-Resolution Using Deep Identity Learning\u0000(NSSR-DIL). The proposed NSSR-DIL model requires fewer computational resources,\u0000at least by an order of 10, and demonstrates a competitive performance on\u0000benchmark ISR datasets. Another salient aspect of our proposition is that the\u0000NSSR-DIL framework detours retraining the model and remains the same for\u0000varying scale factors like X2, X3, and X4. This makes our highly efficient ISR\u0000model more suitable for real-world applications.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"15 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250523","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
fMRI-3D: A Comprehensive Dataset for Enhancing fMRI-based 3D Reconstruction fMRI-3D:用于增强基于 fMRI 的三维重建的综合数据集
arXiv - CS - Computer Vision and Pattern Recognition Pub Date : 2024-09-17 DOI: arxiv-2409.11315
Jianxiong Gao, Yuqian Fu, Yun Wang, Xuelin Qian, Jianfeng Feng, Yanwei Fu
{"title":"fMRI-3D: A Comprehensive Dataset for Enhancing fMRI-based 3D Reconstruction","authors":"Jianxiong Gao, Yuqian Fu, Yun Wang, Xuelin Qian, Jianfeng Feng, Yanwei Fu","doi":"arxiv-2409.11315","DOIUrl":"https://doi.org/arxiv-2409.11315","url":null,"abstract":"Reconstructing 3D visuals from functional Magnetic Resonance Imaging (fMRI)\u0000data, introduced as Recon3DMind in our conference work, is of significant\u0000interest to both cognitive neuroscience and computer vision. To advance this\u0000task, we present the fMRI-3D dataset, which includes data from 15 participants\u0000and showcases a total of 4768 3D objects. The dataset comprises two components:\u0000fMRI-Shape, previously introduced and accessible at\u0000https://huggingface.co/datasets/Fudan-fMRI/fMRI-Shape, and fMRI-Objaverse,\u0000proposed in this paper and available at\u0000https://huggingface.co/datasets/Fudan-fMRI/fMRI-Objaverse. fMRI-Objaverse\u0000includes data from 5 subjects, 4 of whom are also part of the Core set in\u0000fMRI-Shape, with each subject viewing 3142 3D objects across 117 categories,\u0000all accompanied by text captions. This significantly enhances the diversity and\u0000potential applications of the dataset. Additionally, we propose MinD-3D, a\u0000novel framework designed to decode 3D visual information from fMRI signals. The\u0000framework first extracts and aggregates features from fMRI data using a\u0000neuro-fusion encoder, then employs a feature-bridge diffusion model to generate\u0000visual features, and finally reconstructs the 3D object using a generative\u0000transformer decoder. We establish new benchmarks by designing metrics at both\u0000semantic and structural levels to evaluate model performance. Furthermore, we\u0000assess our model's effectiveness in an Out-of-Distribution setting and analyze\u0000the attribution of the extracted features and the visual ROIs in fMRI signals.\u0000Our experiments demonstrate that MinD-3D not only reconstructs 3D objects with\u0000high semantic and spatial accuracy but also deepens our understanding of how\u0000human brain processes 3D visual information. Project page at:\u0000https://jianxgao.github.io/MinD-3D.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"188 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250668","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Phidias: A Generative Model for Creating 3D Content from Text, Image, and 3D Conditions with Reference-Augmented Diffusion 菲迪亚斯利用参考增强扩散从文本、图像和三维条件创建三维内容的生成模型
arXiv - CS - Computer Vision and Pattern Recognition Pub Date : 2024-09-17 DOI: arxiv-2409.11406
Zhenwei Wang, Tengfei Wang, Zexin He, Gerhard Hancke, Ziwei Liu, Rynson W. H. Lau
{"title":"Phidias: A Generative Model for Creating 3D Content from Text, Image, and 3D Conditions with Reference-Augmented Diffusion","authors":"Zhenwei Wang, Tengfei Wang, Zexin He, Gerhard Hancke, Ziwei Liu, Rynson W. H. Lau","doi":"arxiv-2409.11406","DOIUrl":"https://doi.org/arxiv-2409.11406","url":null,"abstract":"In 3D modeling, designers often use an existing 3D model as a reference to\u0000create new ones. This practice has inspired the development of Phidias, a novel\u0000generative model that uses diffusion for reference-augmented 3D generation.\u0000Given an image, our method leverages a retrieved or user-provided 3D reference\u0000model to guide the generation process, thereby enhancing the generation\u0000quality, generalization ability, and controllability. Our model integrates\u0000three key components: 1) meta-ControlNet that dynamically modulates the\u0000conditioning strength, 2) dynamic reference routing that mitigates misalignment\u0000between the input image and 3D reference, and 3) self-reference augmentations\u0000that enable self-supervised training with a progressive curriculum.\u0000Collectively, these designs result in a clear improvement over existing\u0000methods. Phidias establishes a unified framework for 3D generation using text,\u0000image, and 3D conditions with versatile applications.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"191 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250615","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
GS-Net: Generalizable Plug-and-Play 3D Gaussian Splatting Module GS-Net:可通用的即插即用 3D 高斯拼接模块
arXiv - CS - Computer Vision and Pattern Recognition Pub Date : 2024-09-17 DOI: arxiv-2409.11307
Yichen Zhang, Zihan Wang, Jiali Han, Peilin Li, Jiaxun Zhang, Jianqiang Wang, Lei He, Keqiang Li
{"title":"GS-Net: Generalizable Plug-and-Play 3D Gaussian Splatting Module","authors":"Yichen Zhang, Zihan Wang, Jiali Han, Peilin Li, Jiaxun Zhang, Jianqiang Wang, Lei He, Keqiang Li","doi":"arxiv-2409.11307","DOIUrl":"https://doi.org/arxiv-2409.11307","url":null,"abstract":"3D Gaussian Splatting (3DGS) integrates the strengths of primitive-based\u0000representations and volumetric rendering techniques, enabling real-time,\u0000high-quality rendering. However, 3DGS models typically overfit to single-scene\u0000training and are highly sensitive to the initialization of Gaussian ellipsoids,\u0000heuristically derived from Structure from Motion (SfM) point clouds, which\u0000limits both generalization and practicality. To address these limitations, we\u0000propose GS-Net, a generalizable, plug-and-play 3DGS module that densifies\u0000Gaussian ellipsoids from sparse SfM point clouds, enhancing geometric structure\u0000representation. To the best of our knowledge, GS-Net is the first plug-and-play\u00003DGS module with cross-scene generalization capabilities. Additionally, we\u0000introduce the CARLA-NVS dataset, which incorporates additional camera\u0000viewpoints to thoroughly evaluate reconstruction and rendering quality.\u0000Extensive experiments demonstrate that applying GS-Net to 3DGS yields a PSNR\u0000improvement of 2.08 dB for conventional viewpoints and 1.86 dB for novel\u0000viewpoints, confirming the method's effectiveness and robustness.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"65 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250669","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
RenderWorld: World Model with Self-Supervised 3D Label RenderWorld:带有自监督 3D 标签的世界模型
arXiv - CS - Computer Vision and Pattern Recognition Pub Date : 2024-09-17 DOI: arxiv-2409.11356
Ziyang Yan, Wenzhen Dong, Yihua Shao, Yuhang Lu, Liu Haiyang, Jingwen Liu, Haozhe Wang, Zhe Wang, Yan Wang, Fabio Remondino, Yuexin Ma
{"title":"RenderWorld: World Model with Self-Supervised 3D Label","authors":"Ziyang Yan, Wenzhen Dong, Yihua Shao, Yuhang Lu, Liu Haiyang, Jingwen Liu, Haozhe Wang, Zhe Wang, Yan Wang, Fabio Remondino, Yuexin Ma","doi":"arxiv-2409.11356","DOIUrl":"https://doi.org/arxiv-2409.11356","url":null,"abstract":"End-to-end autonomous driving with vision-only is not only more\u0000cost-effective compared to LiDAR-vision fusion but also more reliable than\u0000traditional methods. To achieve a economical and robust purely visual\u0000autonomous driving system, we propose RenderWorld, a vision-only end-to-end\u0000autonomous driving framework, which generates 3D occupancy labels using a\u0000self-supervised gaussian-based Img2Occ Module, then encodes the labels by\u0000AM-VAE, and uses world model for forecasting and planning. RenderWorld employs\u0000Gaussian Splatting to represent 3D scenes and render 2D images greatly improves\u0000segmentation accuracy and reduces GPU memory consumption compared with\u0000NeRF-based methods. By applying AM-VAE to encode air and non-air separately,\u0000RenderWorld achieves more fine-grained scene element representation, leading to\u0000state-of-the-art performance in both 4D occupancy forecasting and motion\u0000planning from autoregressive world model.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"15 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250621","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
MSDNet: Multi-Scale Decoder for Few-Shot Semantic Segmentation via Transformer-Guided Prototyping MSDNet:通过变压器引导的原型设计实现少镜头语义分割的多尺度解码器
arXiv - CS - Computer Vision and Pattern Recognition Pub Date : 2024-09-17 DOI: arxiv-2409.11316
Amirreza Fateh, Mohammad Reza Mohammadi, Mohammad Reza Jahed Motlagh
{"title":"MSDNet: Multi-Scale Decoder for Few-Shot Semantic Segmentation via Transformer-Guided Prototyping","authors":"Amirreza Fateh, Mohammad Reza Mohammadi, Mohammad Reza Jahed Motlagh","doi":"arxiv-2409.11316","DOIUrl":"https://doi.org/arxiv-2409.11316","url":null,"abstract":"Few-shot Semantic Segmentation addresses the challenge of segmenting objects\u0000in query images with only a handful of annotated examples. However, many\u0000previous state-of-the-art methods either have to discard intricate local\u0000semantic features or suffer from high computational complexity. To address\u0000these challenges, we propose a new Few-shot Semantic Segmentation framework\u0000based on the transformer architecture. Our approach introduces the spatial\u0000transformer decoder and the contextual mask generation module to improve the\u0000relational understanding between support and query images. Moreover, we\u0000introduce a multi-scale decoder to refine the segmentation mask by\u0000incorporating features from different resolutions in a hierarchical manner.\u0000Additionally, our approach integrates global features from intermediate encoder\u0000stages to improve contextual understanding, while maintaining a lightweight\u0000structure to reduce complexity. This balance between performance and efficiency\u0000enables our method to achieve state-of-the-art results on benchmark datasets\u0000such as $PASCAL-5^i$ and $COCO-20^i$ in both 1-shot and 5-shot settings.\u0000Notably, our model with only 1.5 million parameters demonstrates competitive\u0000performance while overcoming limitations of existing methodologies.\u0000https://github.com/amirrezafateh/MSDNet","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"155 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250667","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Uncertainty and Prediction Quality Estimation for Semantic Segmentation via Graph Neural Networks 通过图神经网络估计语义分割的不确定性和预测质量
arXiv - CS - Computer Vision and Pattern Recognition Pub Date : 2024-09-17 DOI: arxiv-2409.11373
Edgar Heinert, Stephan Tilgner, Timo Palm, Matthias Rottmann
{"title":"Uncertainty and Prediction Quality Estimation for Semantic Segmentation via Graph Neural Networks","authors":"Edgar Heinert, Stephan Tilgner, Timo Palm, Matthias Rottmann","doi":"arxiv-2409.11373","DOIUrl":"https://doi.org/arxiv-2409.11373","url":null,"abstract":"When employing deep neural networks (DNNs) for semantic segmentation in\u0000safety-critical applications like automotive perception or medical imaging, it\u0000is important to estimate their performance at runtime, e.g. via uncertainty\u0000estimates or prediction quality estimates. Previous works mostly performed\u0000uncertainty estimation on pixel-level. In a line of research, a\u0000connected-component-wise (segment-wise) perspective was taken, approaching\u0000uncertainty estimation on an object-level by performing so-called meta\u0000classification and regression to estimate uncertainty and prediction quality,\u0000respectively. In those works, each predicted segment is considered individually\u0000to estimate its uncertainty or prediction quality. However, the neighboring\u0000segments may provide additional hints on whether a given predicted segment is\u0000of high quality, which we study in the present work. On the basis of\u0000uncertainty indicating metrics on segment-level, we use graph neural networks\u0000(GNNs) to model the relationship of a given segment's quality as a function of\u0000the given segment's metrics as well as those of its neighboring segments. We\u0000compare different GNN architectures and achieve a notable performance\u0000improvement.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"37 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250617","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Fine-Tuning Image-Conditional Diffusion Models is Easier than You Think 微调图像条件扩散模型比想象中更容易
arXiv - CS - Computer Vision and Pattern Recognition Pub Date : 2024-09-17 DOI: arxiv-2409.11355
Gonzalo Martin Garcia, Karim Abou Zeid, Christian Schmidt, Daan de Geus, Alexander Hermans, Bastian Leibe
{"title":"Fine-Tuning Image-Conditional Diffusion Models is Easier than You Think","authors":"Gonzalo Martin Garcia, Karim Abou Zeid, Christian Schmidt, Daan de Geus, Alexander Hermans, Bastian Leibe","doi":"arxiv-2409.11355","DOIUrl":"https://doi.org/arxiv-2409.11355","url":null,"abstract":"Recent work showed that large diffusion models can be reused as highly\u0000precise monocular depth estimators by casting depth estimation as an\u0000image-conditional image generation task. While the proposed model achieved\u0000state-of-the-art results, high computational demands due to multi-step\u0000inference limited its use in many scenarios. In this paper, we show that the\u0000perceived inefficiency was caused by a flaw in the inference pipeline that has\u0000so far gone unnoticed. The fixed model performs comparably to the best\u0000previously reported configuration while being more than 200$times$ faster. To\u0000optimize for downstream task performance, we perform end-to-end fine-tuning on\u0000top of the single-step model with task-specific losses and get a deterministic\u0000model that outperforms all other diffusion-based depth and normal estimation\u0000models on common zero-shot benchmarks. We surprisingly find that this\u0000fine-tuning protocol also works directly on Stable Diffusion and achieves\u0000comparable performance to current state-of-the-art diffusion-based depth and\u0000normal estimation models, calling into question some of the conclusions drawn\u0000from prior works.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"14 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250622","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
OmniGen: Unified Image Generation OmniGen:统一图像生成
arXiv - CS - Computer Vision and Pattern Recognition Pub Date : 2024-09-17 DOI: arxiv-2409.11340
Shitao Xiao, Yueze Wang, Junjie Zhou, Huaying Yuan, Xingrun Xing, Ruiran Yan, Shuting Wang, Tiejun Huang, Zheng Liu
{"title":"OmniGen: Unified Image Generation","authors":"Shitao Xiao, Yueze Wang, Junjie Zhou, Huaying Yuan, Xingrun Xing, Ruiran Yan, Shuting Wang, Tiejun Huang, Zheng Liu","doi":"arxiv-2409.11340","DOIUrl":"https://doi.org/arxiv-2409.11340","url":null,"abstract":"In this work, we introduce OmniGen, a new diffusion model for unified image\u0000generation. Unlike popular diffusion models (e.g., Stable Diffusion), OmniGen\u0000no longer requires additional modules such as ControlNet or IP-Adapter to\u0000process diverse control conditions. OmniGenis characterized by the following\u0000features: 1) Unification: OmniGen not only demonstrates text-to-image\u0000generation capabilities but also inherently supports other downstream tasks,\u0000such as image editing, subject-driven generation, and visual-conditional\u0000generation. Additionally, OmniGen can handle classical computer vision tasks by\u0000transforming them into image generation tasks, such as edge detection and human\u0000pose recognition. 2) Simplicity: The architecture of OmniGen is highly\u0000simplified, eliminating the need for additional text encoders. Moreover, it is\u0000more user-friendly compared to existing diffusion models, enabling complex\u0000tasks to be accomplished through instructions without the need for extra\u0000preprocessing steps (e.g., human pose estimation), thereby significantly\u0000simplifying the workflow of image generation. 3) Knowledge Transfer: Through\u0000learning in a unified format, OmniGen effectively transfers knowledge across\u0000different tasks, manages unseen tasks and domains, and exhibits novel\u0000capabilities. We also explore the model's reasoning capabilities and potential\u0000applications of chain-of-thought mechanism. This work represents the first\u0000attempt at a general-purpose image generation model, and there remain several\u0000unresolved issues. We will open-source the related resources at\u0000https://github.com/VectorSpaceLab/OmniGen to foster advancements in this field.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"205 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250623","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信