Gonzalo Martin Garcia, Karim Abou Zeid, Christian Schmidt, Daan de Geus, Alexander Hermans, Bastian Leibe
{"title":"Fine-Tuning Image-Conditional Diffusion Models is Easier than You Think","authors":"Gonzalo Martin Garcia, Karim Abou Zeid, Christian Schmidt, Daan de Geus, Alexander Hermans, Bastian Leibe","doi":"arxiv-2409.11355","DOIUrl":null,"url":null,"abstract":"Recent work showed that large diffusion models can be reused as highly\nprecise monocular depth estimators by casting depth estimation as an\nimage-conditional image generation task. While the proposed model achieved\nstate-of-the-art results, high computational demands due to multi-step\ninference limited its use in many scenarios. In this paper, we show that the\nperceived inefficiency was caused by a flaw in the inference pipeline that has\nso far gone unnoticed. The fixed model performs comparably to the best\npreviously reported configuration while being more than 200$\\times$ faster. To\noptimize for downstream task performance, we perform end-to-end fine-tuning on\ntop of the single-step model with task-specific losses and get a deterministic\nmodel that outperforms all other diffusion-based depth and normal estimation\nmodels on common zero-shot benchmarks. We surprisingly find that this\nfine-tuning protocol also works directly on Stable Diffusion and achieves\ncomparable performance to current state-of-the-art diffusion-based depth and\nnormal estimation models, calling into question some of the conclusions drawn\nfrom prior works.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Computer Vision and Pattern Recognition","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.11355","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Recent work showed that large diffusion models can be reused as highly
precise monocular depth estimators by casting depth estimation as an
image-conditional image generation task. While the proposed model achieved
state-of-the-art results, high computational demands due to multi-step
inference limited its use in many scenarios. In this paper, we show that the
perceived inefficiency was caused by a flaw in the inference pipeline that has
so far gone unnoticed. The fixed model performs comparably to the best
previously reported configuration while being more than 200$\times$ faster. To
optimize for downstream task performance, we perform end-to-end fine-tuning on
top of the single-step model with task-specific losses and get a deterministic
model that outperforms all other diffusion-based depth and normal estimation
models on common zero-shot benchmarks. We surprisingly find that this
fine-tuning protocol also works directly on Stable Diffusion and achieves
comparable performance to current state-of-the-art diffusion-based depth and
normal estimation models, calling into question some of the conclusions drawn
from prior works.