Magic Fixup: Streamlining Photo Editing by Watching Dynamic Videos

IF 7.8 1区计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING

ACM Transactions on Graphics Pub Date : 2025-07-24 DOI:10.1145/3750722

Hadi Alzayer, Zhihao Xia, Xuaner (Cecilia) Zhang, Eli Shechtman, Jia-Bin Huang, Michael Gharbi

{"title":"Magic Fixup: Streamlining Photo Editing by Watching Dynamic Videos","authors":"Hadi Alzayer, Zhihao Xia, Xuaner (Cecilia) Zhang, Eli Shechtman, Jia-Bin Huang, Michael Gharbi","doi":"10.1145/3750722","DOIUrl":null,"url":null,"abstract":"We propose a generative model that, given a coarsely edited image, synthesizes a photorealistic output that follows the prescribed layout. Our method transfers fine details from the original image and preserve the identity of its parts. Yet, it adapts it to the lighting and context defined by the new layout. Our key insight is that videos are a powerful source of supervision for this task: objects and camera motions provide many observations of how the world changes with viewpoint, lighting, and physical interactions. We construct an image dataset in which each sample is a pair of source and target frames extracted from the same video at randomly chosen time intervals. We warp the source frame toward the target using two motion models that mimic the expected test-time user edits. We supervise our model to translate the warped image into the ground truth, starting from a pretrained diffusion model. Our model design explicitly enables fine detail transfer from the source frame to the generated image, while closely following the user-specified layout. We show that by using simple segmentations and coarse 2D manipulations, we can synthesize a photorealistic edit faithful to the user’s input while addressing second-order effects like harmonizing the lighting and physical interactions between edited objects. Project page and code can be found at https://magic-fixup.github.io","PeriodicalId":50913,"journal":{"name":"ACM Transactions on Graphics","volume":"20 1","pages":""},"PeriodicalIF":7.8000,"publicationDate":"2025-07-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Graphics","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1145/3750722","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}

引用次数: 0

Abstract

We propose a generative model that, given a coarsely edited image, synthesizes a photorealistic output that follows the prescribed layout. Our method transfers fine details from the original image and preserve the identity of its parts. Yet, it adapts it to the lighting and context defined by the new layout. Our key insight is that videos are a powerful source of supervision for this task: objects and camera motions provide many observations of how the world changes with viewpoint, lighting, and physical interactions. We construct an image dataset in which each sample is a pair of source and target frames extracted from the same video at randomly chosen time intervals. We warp the source frame toward the target using two motion models that mimic the expected test-time user edits. We supervise our model to translate the warped image into the ground truth, starting from a pretrained diffusion model. Our model design explicitly enables fine detail transfer from the source frame to the generated image, while closely following the user-specified layout. We show that by using simple segmentations and coarse 2D manipulations, we can synthesize a photorealistic edit faithful to the user’s input while addressing second-order effects like harmonizing the lighting and physical interactions between edited objects. Project page and code can be found at https://magic-fixup.github.io

查看原文本刊更多论文

魔术修复：通过观看动态视频简化照片编辑

我们提出了一个生成模型，给定一个粗略编辑的图像，合成一个符合规定布局的逼真输出。我们的方法从原始图像中转移细节，并保持其部分的身份。然而，它适应了新布局所定义的照明和环境。我们的关键观点是，视频是这项任务的强大监督来源：物体和相机运动提供了许多关于世界如何随着视点、照明和物理相互作用而变化的观察。我们构建了一个图像数据集，其中每个样本是在随机选择的时间间隔从同一视频中提取的一对源帧和目标帧。我们使用两个模拟预期测试时用户编辑的运动模型将源帧向目标方向扭曲。我们从一个预训练的扩散模型开始，监督我们的模型将扭曲的图像转化为真实的事实。我们的模型设计明确地实现了从源帧到生成图像的精细细节传输，同时密切遵循用户指定的布局。我们表明，通过使用简单的分割和粗糙的2D操作，我们可以合成忠实于用户输入的逼真编辑，同时解决二阶效果，如协调编辑对象之间的照明和物理交互。项目页面和代码可以在https://magic-fixup.github.io上找到

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

ACM Transactions on Graphics 工程技术-计算机：软件工程

CiteScore

14.30

自引率

25.80%

发文量

193

审稿时长

12 months

期刊介绍： ACM Transactions on Graphics (TOG) is a peer-reviewed scientific journal that aims to disseminate the latest findings of note in the field of computer graphics. It has been published since 1982 by the Association for Computing Machinery. Starting in 2003, all papers accepted for presentation at the annual SIGGRAPH conference are printed in a special summer issue of the journal.