{"title":"Inv-Adapter: ID Customization Generation via Image Inversion and Lightweight Parameter Adapter.","authors":"Peng Xing,Ning Wang,Jianbo Ouyang,Zechao Li","doi":"10.1109/tpami.2025.3590321","DOIUrl":null,"url":null,"abstract":"The remarkable advancement in text-to-image generation models significantly boosts the research in ID customization generation. However, existing personalization methods cannot simultaneously satisfy high-fidelity and low-costs requirements. Their main bottleneck lies in the additional prompt image encoder (i.e., CLIP vision encoder), which produces weak alignment signals with the text-to-image model that may lose face information and is not well 'absorbed' by the text-to-image model. Towards this end, we propose Inv-Adapter, which first introduces a more reasonable and efficient token representation of ID image features and introduces a lightweight parameter adaptor to inject ID features. Specifically, our Inv-Adapter extracts diffusion-domain representations of ID images utilizing a pre-trained text-to-image model via DDIM image inversion, without an additional image encoder. Benefiting from the high alignment of the extracted ID prompt features and the intermediate features of the text-to-image model, we then introduce a lightweight attention adapter to embed them efficiently into the base text-to-image model. We conduct extensive experiments on different text-to-image models to assess ID fidelity, generation loyalty, speed, training costs, model scale and generalization ability in scenarios of general object, all of which show that the proposed Inv-Adapter is highly competitive in ID customization generation and model scale.","PeriodicalId":13426,"journal":{"name":"IEEE Transactions on Pattern Analysis and Machine Intelligence","volume":"24 1","pages":""},"PeriodicalIF":20.8000,"publicationDate":"2025-07-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Pattern Analysis and Machine Intelligence","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1109/tpami.2025.3590321","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
The remarkable advancement in text-to-image generation models significantly boosts the research in ID customization generation. However, existing personalization methods cannot simultaneously satisfy high-fidelity and low-costs requirements. Their main bottleneck lies in the additional prompt image encoder (i.e., CLIP vision encoder), which produces weak alignment signals with the text-to-image model that may lose face information and is not well 'absorbed' by the text-to-image model. Towards this end, we propose Inv-Adapter, which first introduces a more reasonable and efficient token representation of ID image features and introduces a lightweight parameter adaptor to inject ID features. Specifically, our Inv-Adapter extracts diffusion-domain representations of ID images utilizing a pre-trained text-to-image model via DDIM image inversion, without an additional image encoder. Benefiting from the high alignment of the extracted ID prompt features and the intermediate features of the text-to-image model, we then introduce a lightweight attention adapter to embed them efficiently into the base text-to-image model. We conduct extensive experiments on different text-to-image models to assess ID fidelity, generation loyalty, speed, training costs, model scale and generalization ability in scenarios of general object, all of which show that the proposed Inv-Adapter is highly competitive in ID customization generation and model scale.
期刊介绍:
The IEEE Transactions on Pattern Analysis and Machine Intelligence publishes articles on all traditional areas of computer vision and image understanding, all traditional areas of pattern analysis and recognition, and selected areas of machine intelligence, with a particular emphasis on machine learning for pattern analysis. Areas such as techniques for visual search, document and handwriting analysis, medical image analysis, video and image sequence analysis, content-based retrieval of image and video, face and gesture recognition and relevant specialized hardware and/or software architectures are also covered.