{"title":"Parameter efficient finetuning of text-to-image models with trainable self-attention layer","authors":"Zhuoyuan Li, Yi Sun","doi":"10.1016/j.imavis.2024.105296","DOIUrl":null,"url":null,"abstract":"<div><div>We propose a novel model to efficiently finetune pretrained Text-to-Image models by introducing additional image prompts. The model integrates information from image prompts into the text-to-image (T2I) diffusion process by locking the parameters of the large T2I model and reusing its trainable copy, rather than relying on additional adapters. The trainable copy guides the model by injecting its trainable self-attention features into the original diffusion model, enabling the synthesis of a new specific concept. We also apply Low-Rank Adaptation (LoRA) to restrict the trainable parameters in the self-attention layers. Furthermore, the network is optimized alongside a text embedding that serves as an object identifier to generate contextually relevant visual content. Our model is simple and effective, with a small memory footprint, yet can achieve comparable performance to a fully fine-tuned T2I model in both qualitative and quantitative evaluations.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"151 ","pages":"Article 105296"},"PeriodicalIF":4.2000,"publicationDate":"2024-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Image and Vision Computing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0262885624004013","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
We propose a novel model to efficiently finetune pretrained Text-to-Image models by introducing additional image prompts. The model integrates information from image prompts into the text-to-image (T2I) diffusion process by locking the parameters of the large T2I model and reusing its trainable copy, rather than relying on additional adapters. The trainable copy guides the model by injecting its trainable self-attention features into the original diffusion model, enabling the synthesis of a new specific concept. We also apply Low-Rank Adaptation (LoRA) to restrict the trainable parameters in the self-attention layers. Furthermore, the network is optimized alongside a text embedding that serves as an object identifier to generate contextually relevant visual content. Our model is simple and effective, with a small memory footprint, yet can achieve comparable performance to a fully fine-tuned T2I model in both qualitative and quantitative evaluations.
期刊介绍:
Image and Vision Computing has as a primary aim the provision of an effective medium of interchange for the results of high quality theoretical and applied research fundamental to all aspects of image interpretation and computer vision. The journal publishes work that proposes new image interpretation and computer vision methodology or addresses the application of such methods to real world scenes. It seeks to strengthen a deeper understanding in the discipline by encouraging the quantitative comparison and performance evaluation of the proposed methodology. The coverage includes: image interpretation, scene modelling, object recognition and tracking, shape analysis, monitoring and surveillance, active vision and robotic systems, SLAM, biologically-inspired computer vision, motion analysis, stereo vision, document image understanding, character and handwritten text recognition, face and gesture recognition, biometrics, vision-based human-computer interaction, human activity and behavior understanding, data fusion from multiple sensor inputs, image databases.