Gonzalo Martin Garcia, Karim Abou Zeid, Christian Schmidt, Daan de Geus, Alexander Hermans, Bastian Leibe
{"title":"微调图像条件扩散模型比想象中更容易","authors":"Gonzalo Martin Garcia, Karim Abou Zeid, Christian Schmidt, Daan de Geus, Alexander Hermans, Bastian Leibe","doi":"arxiv-2409.11355","DOIUrl":null,"url":null,"abstract":"Recent work showed that large diffusion models can be reused as highly\nprecise monocular depth estimators by casting depth estimation as an\nimage-conditional image generation task. While the proposed model achieved\nstate-of-the-art results, high computational demands due to multi-step\ninference limited its use in many scenarios. In this paper, we show that the\nperceived inefficiency was caused by a flaw in the inference pipeline that has\nso far gone unnoticed. The fixed model performs comparably to the best\npreviously reported configuration while being more than 200$\\times$ faster. To\noptimize for downstream task performance, we perform end-to-end fine-tuning on\ntop of the single-step model with task-specific losses and get a deterministic\nmodel that outperforms all other diffusion-based depth and normal estimation\nmodels on common zero-shot benchmarks. We surprisingly find that this\nfine-tuning protocol also works directly on Stable Diffusion and achieves\ncomparable performance to current state-of-the-art diffusion-based depth and\nnormal estimation models, calling into question some of the conclusions drawn\nfrom prior works.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Fine-Tuning Image-Conditional Diffusion Models is Easier than You Think\",\"authors\":\"Gonzalo Martin Garcia, Karim Abou Zeid, Christian Schmidt, Daan de Geus, Alexander Hermans, Bastian Leibe\",\"doi\":\"arxiv-2409.11355\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Recent work showed that large diffusion models can be reused as highly\\nprecise monocular depth estimators by casting depth estimation as an\\nimage-conditional image generation task. While the proposed model achieved\\nstate-of-the-art results, high computational demands due to multi-step\\ninference limited its use in many scenarios. In this paper, we show that the\\nperceived inefficiency was caused by a flaw in the inference pipeline that has\\nso far gone unnoticed. The fixed model performs comparably to the best\\npreviously reported configuration while being more than 200$\\\\times$ faster. To\\noptimize for downstream task performance, we perform end-to-end fine-tuning on\\ntop of the single-step model with task-specific losses and get a deterministic\\nmodel that outperforms all other diffusion-based depth and normal estimation\\nmodels on common zero-shot benchmarks. We surprisingly find that this\\nfine-tuning protocol also works directly on Stable Diffusion and achieves\\ncomparable performance to current state-of-the-art diffusion-based depth and\\nnormal estimation models, calling into question some of the conclusions drawn\\nfrom prior works.\",\"PeriodicalId\":501130,\"journal\":{\"name\":\"arXiv - CS - Computer Vision and Pattern Recognition\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-17\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Computer Vision and Pattern Recognition\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2409.11355\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Computer Vision and Pattern Recognition","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.11355","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Fine-Tuning Image-Conditional Diffusion Models is Easier than You Think
Recent work showed that large diffusion models can be reused as highly
precise monocular depth estimators by casting depth estimation as an
image-conditional image generation task. While the proposed model achieved
state-of-the-art results, high computational demands due to multi-step
inference limited its use in many scenarios. In this paper, we show that the
perceived inefficiency was caused by a flaw in the inference pipeline that has
so far gone unnoticed. The fixed model performs comparably to the best
previously reported configuration while being more than 200$\times$ faster. To
optimize for downstream task performance, we perform end-to-end fine-tuning on
top of the single-step model with task-specific losses and get a deterministic
model that outperforms all other diffusion-based depth and normal estimation
models on common zero-shot benchmarks. We surprisingly find that this
fine-tuning protocol also works directly on Stable Diffusion and achieves
comparable performance to current state-of-the-art diffusion-based depth and
normal estimation models, calling into question some of the conclusions drawn
from prior works.