PixTention: Dynamic pixel-level adapter using attention maps

IF 4.2 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Image and Vision Computing Pub Date : 2025-09-27 DOI:10.1016/j.imavis.2025.105746

Dooho Choi, Yunsick Sung

{"title":"PixTention: Dynamic pixel-level adapter using attention maps","authors":"Dooho Choi, Yunsick Sung","doi":"10.1016/j.imavis.2025.105746","DOIUrl":null,"url":null,"abstract":"<div><div>Recent advances in image generation have popularized adapter-based fine-tuning, where Low-Rank Adaptation (LoRA) modules enable efficient personalization with minimal storage costs. However, current approaches often suffer from two key limitations: (1) manually selecting suitable LoRA adapters is time-consuming and requires expert knowledge, and (2) applying multiple adapters globally can introduce style interference and reduce image fidelity, especially for prompts with multiple distinct concepts. We propose <strong>PixTention</strong>, a framework that addresses these challenges via a novel three-stage process: <em>Curator</em>, <em>Selector</em>, and <em>Integrator</em>. The Curator uses a vision-language model to generate enriched semantic descriptions of LoRA adapters and clusters their embeddings based on shared visual themes, enabling efficient hierarchical retrieval. The Selector embeds user prompts and first selects the most relevant adapter clusters, then identifies top-K adapters within them via cosine similarity. The Integrator leverages cross-attention maps from diffusion models to assign each retrieved adapter to specific semantic regions in the output image, ensuring localized, prompt-aligned transformations without global style overwriting. Through experiments on COCO-Multi and a custom StyleCompose dataset, PixTention achieves higher CLIP scores, IoU and lower FID than baseline retrieval and reranking methods, demonstrating superior text-image alignment and image realism. Our results highlight the importance of semantic clustering, region-specific adapter composition, and cross-modal alignment in advancing controllable, high-fidelity image generation.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"163 ","pages":"Article 105746"},"PeriodicalIF":4.2000,"publicationDate":"2025-09-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Image and Vision Computing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0262885625003348","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Recent advances in image generation have popularized adapter-based fine-tuning, where Low-Rank Adaptation (LoRA) modules enable efficient personalization with minimal storage costs. However, current approaches often suffer from two key limitations: (1) manually selecting suitable LoRA adapters is time-consuming and requires expert knowledge, and (2) applying multiple adapters globally can introduce style interference and reduce image fidelity, especially for prompts with multiple distinct concepts. We propose PixTention, a framework that addresses these challenges via a novel three-stage process: Curator, Selector, and Integrator. The Curator uses a vision-language model to generate enriched semantic descriptions of LoRA adapters and clusters their embeddings based on shared visual themes, enabling efficient hierarchical retrieval. The Selector embeds user prompts and first selects the most relevant adapter clusters, then identifies top-K adapters within them via cosine similarity. The Integrator leverages cross-attention maps from diffusion models to assign each retrieved adapter to specific semantic regions in the output image, ensuring localized, prompt-aligned transformations without global style overwriting. Through experiments on COCO-Multi and a custom StyleCompose dataset, PixTention achieves higher CLIP scores, IoU and lower FID than baseline retrieval and reranking methods, demonstrating superior text-image alignment and image realism. Our results highlight the importance of semantic clustering, region-specific adapter composition, and cross-modal alignment in advancing controllable, high-fidelity image generation.

Abstract Image

查看原文本刊更多论文

pixattention：使用注意地图的动态像素级适配器

图像生成的最新进展已经普及了基于适配器的微调，其中Low-Rank Adaptation （LoRA）模块能够以最小的存储成本实现高效的个性化。然而，目前的方法往往存在两个关键的局限性：(1)手动选择合适的LoRA适配器耗时且需要专业知识；(2)在全球范围内应用多个适配器可能会引入样式干扰并降低图像保真度，特别是对于具有多个不同概念的提示符。我们提出了pixtension框架，它通过一个新颖的三阶段过程来解决这些挑战：策展人、选择器和集成器。策展人使用视觉语言模型生成丰富的LoRA适配器的语义描述，并基于共享的视觉主题对其嵌入进行聚类，从而实现高效的分层检索。选择器嵌入用户提示，首先选择最相关的适配器集群，然后通过余弦相似性识别其中的top-K适配器。Integrator利用来自扩散模型的交叉注意映射，将每个检索到的适配器分配到输出图像中的特定语义区域，从而确保本地化、及时对齐的转换，而不需要全局样式覆盖。通过在COCO-Multi和自定义StyleCompose数据集上的实验，PixTention获得了比基线检索和重排序方法更高的CLIP分数、IoU和更低的FID，展示了更好的文本-图像对齐和图像真实感。我们的研究结果强调了语义聚类、特定区域适配器组合和跨模态对齐在推进可控、高保真图像生成中的重要性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Image and Vision Computing 工程技术-工程：电子与电气

CiteScore

8.50

自引率

8.50%

发文量

143

审稿时长

7.8 months

期刊介绍： Image and Vision Computing has as a primary aim the provision of an effective medium of interchange for the results of high quality theoretical and applied research fundamental to all aspects of image interpretation and computer vision. The journal publishes work that proposes new image interpretation and computer vision methodology or addresses the application of such methods to real world scenes. It seeks to strengthen a deeper understanding in the discipline by encouraging the quantitative comparison and performance evaluation of the proposed methodology. The coverage includes: image interpretation, scene modelling, object recognition and tracking, shape analysis, monitoring and surveillance, active vision and robotic systems, SLAM, biologically-inspired computer vision, motion analysis, stereo vision, document image understanding, character and handwritten text recognition, face and gesture recognition, biometrics, vision-based human-computer interaction, human activity and behavior understanding, data fusion from multiple sensor inputs, image databases.