Inv-Adapter: ID Customization Generation via Image Inversion and Lightweight Parameter Adapter.

IF 20.8 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

IEEE Transactions on Pattern Analysis and Machine Intelligence Pub Date : 2025-07-17 DOI:10.1109/tpami.2025.3590321

Peng Xing,Ning Wang,Jianbo Ouyang,Zechao Li

{"title":"Inv-Adapter: ID Customization Generation via Image Inversion and Lightweight Parameter Adapter.","authors":"Peng Xing,Ning Wang,Jianbo Ouyang,Zechao Li","doi":"10.1109/tpami.2025.3590321","DOIUrl":null,"url":null,"abstract":"The remarkable advancement in text-to-image generation models significantly boosts the research in ID customization generation. However, existing personalization methods cannot simultaneously satisfy high-fidelity and low-costs requirements. Their main bottleneck lies in the additional prompt image encoder (i.e., CLIP vision encoder), which produces weak alignment signals with the text-to-image model that may lose face information and is not well 'absorbed' by the text-to-image model. Towards this end, we propose Inv-Adapter, which first introduces a more reasonable and efficient token representation of ID image features and introduces a lightweight parameter adaptor to inject ID features. Specifically, our Inv-Adapter extracts diffusion-domain representations of ID images utilizing a pre-trained text-to-image model via DDIM image inversion, without an additional image encoder. Benefiting from the high alignment of the extracted ID prompt features and the intermediate features of the text-to-image model, we then introduce a lightweight attention adapter to embed them efficiently into the base text-to-image model. We conduct extensive experiments on different text-to-image models to assess ID fidelity, generation loyalty, speed, training costs, model scale and generalization ability in scenarios of general object, all of which show that the proposed Inv-Adapter is highly competitive in ID customization generation and model scale.","PeriodicalId":13426,"journal":{"name":"IEEE Transactions on Pattern Analysis and Machine Intelligence","volume":"24 1","pages":""},"PeriodicalIF":20.8000,"publicationDate":"2025-07-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Pattern Analysis and Machine Intelligence","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1109/tpami.2025.3590321","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

The remarkable advancement in text-to-image generation models significantly boosts the research in ID customization generation. However, existing personalization methods cannot simultaneously satisfy high-fidelity and low-costs requirements. Their main bottleneck lies in the additional prompt image encoder (i.e., CLIP vision encoder), which produces weak alignment signals with the text-to-image model that may lose face information and is not well 'absorbed' by the text-to-image model. Towards this end, we propose Inv-Adapter, which first introduces a more reasonable and efficient token representation of ID image features and introduces a lightweight parameter adaptor to inject ID features. Specifically, our Inv-Adapter extracts diffusion-domain representations of ID images utilizing a pre-trained text-to-image model via DDIM image inversion, without an additional image encoder. Benefiting from the high alignment of the extracted ID prompt features and the intermediate features of the text-to-image model, we then introduce a lightweight attention adapter to embed them efficiently into the base text-to-image model. We conduct extensive experiments on different text-to-image models to assess ID fidelity, generation loyalty, speed, training costs, model scale and generalization ability in scenarios of general object, all of which show that the proposed Inv-Adapter is highly competitive in ID customization generation and model scale.

查看原文本刊更多论文

Inv-Adapter：通过图像反转和轻量级参数适配器生成ID定制。

文本到图像生成模型的显著进步极大地推动了ID定制生成的研究。然而，现有的个性化方法无法同时满足高保真度和低成本的要求。它们的主要瓶颈在于额外的提示图像编码器（即CLIP视觉编码器），它与文本到图像模型产生微弱的对齐信号，可能会丢失面部信息，并且不能很好地被文本到图像模型“吸收”。为此，我们提出了Inv-Adapter，它首先引入了一种更合理和有效的ID图像特征的令牌表示，并引入了一个轻量级的参数适配器来注入ID特征。具体来说，我们的Inv-Adapter通过DDIM图像反演利用预训练的文本到图像模型提取ID图像的扩散域表示，而无需额外的图像编码器。利用提取的ID提示特征和文本到图像模型的中间特征的高度一致性，我们引入了一个轻量级的注意力适配器，将它们有效地嵌入到基本的文本到图像模型中。我们在不同的文本到图像模型上进行了大量的实验，以评估一般对象场景下的ID保真度、生成忠诚度、速度、训练成本、模型规模和泛化能力，所有这些都表明所提出的Inv-Adapter在ID定制生成和模型规模方面具有很强的竞争力。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Transactions on Pattern Analysis and Machine Intelligence 工程技术-工程：电子与电气

CiteScore

28.40

自引率

3.00%

发文量

885

审稿时长

8.5 months

期刊介绍： The IEEE Transactions on Pattern Analysis and Machine Intelligence publishes articles on all traditional areas of computer vision and image understanding, all traditional areas of pattern analysis and recognition, and selected areas of machine intelligence, with a particular emphasis on machine learning for pattern analysis. Areas such as techniques for visual search, document and handwriting analysis, medical image analysis, video and image sequence analysis, content-based retrieval of image and video, face and gesture recognition and relevant specialized hardware and/or software architectures are also covered.