Parameter efficient finetuning of text-to-image models with trainable self-attention layer

IF 4.2 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Image and Vision Computing Pub Date : 2024-10-10 DOI:10.1016/j.imavis.2024.105296

Zhuoyuan Li, Yi Sun

引用次数: 0

Abstract

We propose a novel model to efficiently finetune pretrained Text-to-Image models by introducing additional image prompts. The model integrates information from image prompts into the text-to-image (T2I) diffusion process by locking the parameters of the large T2I model and reusing its trainable copy, rather than relying on additional adapters. The trainable copy guides the model by injecting its trainable self-attention features into the original diffusion model, enabling the synthesis of a new specific concept. We also apply Low-Rank Adaptation (LoRA) to restrict the trainable parameters in the self-attention layers. Furthermore, the network is optimized alongside a text embedding that serves as an object identifier to generate contextually relevant visual content. Our model is simple and effective, with a small memory footprint, yet can achieve comparable performance to a fully fine-tuned T2I model in both qualitative and quantitative evaluations.

查看原文本刊更多论文

利用可训练的自我注意层对文本到图像模型进行参数高效微调

我们提出了一种新颖的模型，通过引入额外的图像提示，有效地对预训练的文本到图像模型进行微调。该模型通过锁定大型 T2I 模型的参数并重复使用其可训练副本，而不是依赖额外的适配器，将图像提示信息整合到文本到图像（T2I）的扩散过程中。可训练副本通过将其可训练的自我注意特征注入原始扩散模型来引导模型，从而实现新的特定概念的合成。我们还应用了低级自适应（Low-Rank Adaptation，LoRA）技术来限制自我注意层中的可训练参数。此外，我们还对网络进行了优化，将文本嵌入作为对象标识符，以生成与上下文相关的视觉内容。我们的模型简单有效，内存占用小，但在定性和定量评估中的性能可与经过全面微调的 T2I 模型相媲美。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Image and Vision Computing 工程技术-工程：电子与电气

CiteScore

8.50

自引率

8.50%

发文量

143

审稿时长

7.8 months

期刊介绍： Image and Vision Computing has as a primary aim the provision of an effective medium of interchange for the results of high quality theoretical and applied research fundamental to all aspects of image interpretation and computer vision. The journal publishes work that proposes new image interpretation and computer vision methodology or addresses the application of such methods to real world scenes. It seeks to strengthen a deeper understanding in the discipline by encouraging the quantitative comparison and performance evaluation of the proposed methodology. The coverage includes: image interpretation, scene modelling, object recognition and tracking, shape analysis, monitoring and surveillance, active vision and robotic systems, SLAM, biologically-inspired computer vision, motion analysis, stereo vision, document image understanding, character and handwritten text recognition, face and gesture recognition, biometrics, vision-based human-computer interaction, human activity and behavior understanding, data fusion from multiple sensor inputs, image databases.