Agent Attention: On the Integration of Softmax and Linear Attention

arXiv - CS - Computer Vision and Pattern Recognition Pub Date : 2023-12-14 DOI:arxiv-2312.08874

Dongchen Han, Tianzhu Ye, Yizeng Han, Zhuofan Xia, Shiji Song, Gao Huang

{"title":"Agent Attention: On the Integration of Softmax and Linear Attention","authors":"Dongchen Han, Tianzhu Ye, Yizeng Han, Zhuofan Xia, Shiji Song, Gao Huang","doi":"arxiv-2312.08874","DOIUrl":null,"url":null,"abstract":"The attention module is the key component in Transformers. While the global\nattention mechanism offers high expressiveness, its excessive computational\ncost restricts its applicability in various scenarios. In this paper, we\npropose a novel attention paradigm, Agent Attention, to strike a favorable\nbalance between computational efficiency and representation power.\nSpecifically, the Agent Attention, denoted as a quadruple $(Q, A, K, V)$,\nintroduces an additional set of agent tokens $A$ into the conventional\nattention module. The agent tokens first act as the agent for the query tokens\n$Q$ to aggregate information from $K$ and $V$, and then broadcast the\ninformation back to $Q$. Given the number of agent tokens can be designed to be\nmuch smaller than the number of query tokens, the agent attention is\nsignificantly more efficient than the widely adopted Softmax attention, while\npreserving global context modelling capability. Interestingly, we show that the\nproposed agent attention is equivalent to a generalized form of linear\nattention. Therefore, agent attention seamlessly integrates the powerful\nSoftmax attention and the highly efficient linear attention. Extensive\nexperiments demonstrate the effectiveness of agent attention with various\nvision Transformers and across diverse vision tasks, including image\nclassification, object detection, semantic segmentation and image generation.\nNotably, agent attention has shown remarkable performance in high-resolution\nscenarios, owning to its linear attention nature. For instance, when applied to\nStable Diffusion, our agent attention accelerates generation and substantially\nenhances image generation quality without any additional training. Code is\navailable at https://github.com/LeapLabTHU/Agent-Attention.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"21 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2023-12-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Computer Vision and Pattern Recognition","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2312.08874","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

The attention module is the key component in Transformers. While the global attention mechanism offers high expressiveness, its excessive computational cost restricts its applicability in various scenarios. In this paper, we propose a novel attention paradigm, Agent Attention, to strike a favorable balance between computational efficiency and representation power. Specifically, the Agent Attention, denoted as a quadruple $(Q, A, K, V)$, introduces an additional set of agent tokens $A$ into the conventional attention module. The agent tokens first act as the agent for the query tokens $Q$ to aggregate information from $K$ and $V$, and then broadcast the information back to $Q$. Given the number of agent tokens can be designed to be much smaller than the number of query tokens, the agent attention is significantly more efficient than the widely adopted Softmax attention, while preserving global context modelling capability. Interestingly, we show that the proposed agent attention is equivalent to a generalized form of linear attention. Therefore, agent attention seamlessly integrates the powerful Softmax attention and the highly efficient linear attention. Extensive experiments demonstrate the effectiveness of agent attention with various vision Transformers and across diverse vision tasks, including image classification, object detection, semantic segmentation and image generation. Notably, agent attention has shown remarkable performance in high-resolution scenarios, owning to its linear attention nature. For instance, when applied to Stable Diffusion, our agent attention accelerates generation and substantially enhances image generation quality without any additional training. Code is available at https://github.com/LeapLabTHU/Agent-Attention.

查看原文本刊更多论文

代理注意力：论 Softmax 与线性注意力的整合

注意力模块是 Transformers 的关键组件。虽然全局注意力机制具有很强的表现力，但过高的计算成本限制了它在各种场景中的适用性。在本文中，我们提出了一种新颖的注意力范式--代理注意力（Agent Attention），以在计算效率和表示能力之间取得良好平衡。具体来说，代理注意力（Agent Attention）表示为四元组 $（Q、A、K、V）$，在传统注意力模块中引入了一组额外的代理令牌 $A$。代理令牌首先作为查询令牌$Q$的代理，汇总来自$K$和$V$的信息，然后将信息广播回$Q$。鉴于代理令牌的数量可以设计得比查询令牌的数量少得多，代理注意力的效率明显高于广泛采用的 Softmax 注意力，同时还保留了全局上下文建模能力。有趣的是，我们发现所提出的代理注意力等同于线性注意力的广义形式。因此，代理注意力完美地整合了强大的软最大注意力和高效的线性注意力。广泛的实验证明了代理注意力在各种视觉变换器和不同视觉任务中的有效性，包括图像分类、物体检测、语义分割和图像生成。例如，当应用于稳定扩散（Stable Diffusion）时，我们的代理注意力无需任何额外训练即可加速生成并大幅提高图像生成质量。代码见 https://github.com/LeapLabTHU/Agent-Attention。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

arXiv - CS - Computer Vision and Pattern Recognition

自引率

0.00%

发文量