Enhancing Diffusion Models with 3D Perspective Geometry Constraints

ACM Transactions on Graphics (TOG) Pub Date : 2023-12-01 DOI:10.1145/3618389

Rishi Upadhyay, Howard Zhang, Yunhao Ba, Ethan Yang, Blake Gella, Sicheng Jiang, Alex Wong, A. Kadambi

{"title":"Enhancing Diffusion Models with 3D Perspective Geometry Constraints","authors":"Rishi Upadhyay, Howard Zhang, Yunhao Ba, Ethan Yang, Blake Gella, Sicheng Jiang, Alex Wong, A. Kadambi","doi":"10.1145/3618389","DOIUrl":null,"url":null,"abstract":"While perspective is a well-studied topic in art, it is generally taken for granted in images. However, for the recent wave of high-quality image synthesis methods such as latent diffusion models, perspective accuracy is not an explicit requirement. Since these methods are capable of outputting a wide gamut of possible images, it is difficult for these synthesized images to adhere to the principles of linear perspective. We introduce a novel geometric constraint in the training process of generative models to enforce perspective accuracy. We show that outputs of models trained with this constraint both appear more realistic and improve performance of downstream models trained on generated images. Subjective human trials show that images generated with latent diffusion models trained with our constraint are preferred over images from the Stable Diffusion V2 model 70% of the time. SOTA monocular depth estimation models such as DPT and PixelFormer, fine-tuned on our images, outperform the original models trained on real images by up to 7.03% in RMSE and 19.3% in SqRel on the KITTI test set for zero-shot transfer.","PeriodicalId":7077,"journal":{"name":"ACM Transactions on Graphics (TOG)","volume":"57 3","pages":"1 - 15"},"PeriodicalIF":0.0000,"publicationDate":"2023-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Graphics (TOG)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3618389","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

While perspective is a well-studied topic in art, it is generally taken for granted in images. However, for the recent wave of high-quality image synthesis methods such as latent diffusion models, perspective accuracy is not an explicit requirement. Since these methods are capable of outputting a wide gamut of possible images, it is difficult for these synthesized images to adhere to the principles of linear perspective. We introduce a novel geometric constraint in the training process of generative models to enforce perspective accuracy. We show that outputs of models trained with this constraint both appear more realistic and improve performance of downstream models trained on generated images. Subjective human trials show that images generated with latent diffusion models trained with our constraint are preferred over images from the Stable Diffusion V2 model 70% of the time. SOTA monocular depth estimation models such as DPT and PixelFormer, fine-tuned on our images, outperform the original models trained on real images by up to 7.03% in RMSE and 19.3% in SqRel on the KITTI test set for zero-shot transfer.

查看原文本刊更多论文

利用三维透视几何约束增强扩散模型

虽然透视在艺术中是一个被充分研究的话题，但它在图像中通常被认为是理所当然的。然而，对于最近兴起的高质量图像合成方法，如潜在扩散模型，透视精度并不是一个明确的要求。由于这些方法能够输出大范围的可能图像，因此这些合成图像很难坚持线性透视的原则。我们在生成模型的训练过程中引入了一种新的几何约束来增强透视精度。我们表明，使用此约束训练的模型的输出既显得更真实，又提高了在生成图像上训练的下游模型的性能。主观的人体试验表明，使用我们的约束训练的潜在扩散模型生成的图像在70%的情况下优于来自稳定扩散V2模型的图像。SOTA单目深度估计模型，如DPT和PixelFormer，在我们的图像上进行了微调，在KITTI测试集的零镜头转移上，在RMSE和SqRel上的表现比在真实图像上训练的原始模型高出7.03%和19.3%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

ACM Transactions on Graphics (TOG)

自引率

0.00%

发文量