Text optimization with latent inversion for non-rigid image editing

IF 3.3 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Pattern Recognition Letters Pub Date : 2025-06-28 DOI:10.1016/j.patrec.2025.06.011

Yunji Jung , Seokju Lee , Tair Djanibekov , Jong Chul Ye , Hyunjung Shim

{"title":"Text optimization with latent inversion for non-rigid image editing","authors":"Yunji Jung , Seokju Lee , Tair Djanibekov , Jong Chul Ye , Hyunjung Shim","doi":"10.1016/j.patrec.2025.06.011","DOIUrl":null,"url":null,"abstract":"<div><div>Text-guided non-rigid image editing involves complex edits for input images, such as changing motion or compositions of the object (e.g., making a horse jump or adding candles on a cake). Since it requires manipulating the structure of the object, existing methods often compromise “image identity”– defined as the overall object appearance and background details – particularly when combined with Stable Diffusion. In this work, we propose a new approach for non-rigid image editing with Stable Diffusion, aimed at improving the image identity preservation quality without compromising editability. Our approach comprises three stages: text optimization, latent inversion, and timestep-aware text injection sampling. Inspired by the success of Imagic, we employ their text optimization for smooth editing. Then, we introduce latent inversion to preserve the input image’s identity without additional model fine-tuning. To fully utilize the input reconstruction ability of latent inversion, we employ timestep-aware text injection sampling, strategically injecting the source text prompt in early sampling steps and then transitioning to the target prompt in subsequent sampling steps. This strategic approach seamlessly harmonizes with text optimization, facilitating complex non-rigid edits to the input without losing the original identity. We demonstrate the effectiveness of our method in terms of identity preservation, editability, and aesthetic quality through extensive experiments. Our code is available at <span><span>https://github.com/YunjiJung0105/TOLI-non-rigid-editing</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":54638,"journal":{"name":"Pattern Recognition Letters","volume":"196 ","pages":"Pages 281-288"},"PeriodicalIF":3.3000,"publicationDate":"2025-06-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Pattern Recognition Letters","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0167865525002399","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Text-guided non-rigid image editing involves complex edits for input images, such as changing motion or compositions of the object (e.g., making a horse jump or adding candles on a cake). Since it requires manipulating the structure of the object, existing methods often compromise “image identity”– defined as the overall object appearance and background details – particularly when combined with Stable Diffusion. In this work, we propose a new approach for non-rigid image editing with Stable Diffusion, aimed at improving the image identity preservation quality without compromising editability. Our approach comprises three stages: text optimization, latent inversion, and timestep-aware text injection sampling. Inspired by the success of Imagic, we employ their text optimization for smooth editing. Then, we introduce latent inversion to preserve the input image’s identity without additional model fine-tuning. To fully utilize the input reconstruction ability of latent inversion, we employ timestep-aware text injection sampling, strategically injecting the source text prompt in early sampling steps and then transitioning to the target prompt in subsequent sampling steps. This strategic approach seamlessly harmonizes with text optimization, facilitating complex non-rigid edits to the input without losing the original identity. We demonstrate the effectiveness of our method in terms of identity preservation, editability, and aesthetic quality through extensive experiments. Our code is available at https://github.com/YunjiJung0105/TOLI-non-rigid-editing.

查看原文本刊更多论文

文本优化与潜在反演非刚性图像编辑

文本引导的非刚性图像编辑涉及对输入图像的复杂编辑，例如改变物体的运动或组成（例如，让马跳跃或在蛋糕上添加蜡烛）。因为它需要操纵物体的结构，现有的方法通常会损害“图像身份”——定义为物体的整体外观和背景细节——特别是当与稳定扩散相结合时。在这项工作中，我们提出了一种具有稳定扩散的非刚性图像编辑的新方法，旨在提高图像身份保持质量，同时不影响可编辑性。我们的方法包括三个阶段：文本优化、潜在反演和时间步长感知文本注入采样。受Imagic成功的启发，我们采用他们的文本优化进行流畅的编辑。然后，我们引入潜在反演来保持输入图像的身份，而无需额外的模型微调。为了充分利用潜在反演的输入重构能力，我们采用了时间步长感知的文本注入采样，在采样的前几个步骤中有策略地注入源文本提示，然后在随后的采样步骤中过渡到目标提示。这种策略方法与文本优化无缝协调，便于对输入进行复杂的非刚性编辑，而不会丢失原始身份。我们通过广泛的实验证明了我们的方法在身份保存、可编辑性和美学质量方面的有效性。我们的代码可在https://github.com/YunjiJung0105/TOLI-non-rigid-editing上获得。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Pattern Recognition Letters 工程技术-计算机：人工智能

CiteScore

12.40

自引率

5.90%

发文量

287

审稿时长

9.1 months

期刊介绍： Pattern Recognition Letters aims at rapid publication of concise articles of a broad interest in pattern recognition. Subject areas include all the current fields of interest represented by the Technical Committees of the International Association of Pattern Recognition, and other developing themes involving learning and recognition.