Beyond KV caching: Shared attention for efficient LLMs

IF 5.5 2区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Neurocomputing Pub Date : 2025-06-11 DOI:10.1016/j.neucom.2025.130587

Liao Bingli, Danilo Vasconcellos Vargas

{"title":"Beyond KV caching: Shared attention for efficient LLMs","authors":"Liao Bingli, Danilo Vasconcellos Vargas","doi":"10.1016/j.neucom.2025.130587","DOIUrl":null,"url":null,"abstract":"<div><div>The rapid scaling of Large Language Models (LLMs) necessitates advancements in computational and memory efficiency during inference. While methods like Multi-Query Attention (MQA), Grouped-Query Attention (GQA), and Cross-Layer Attention (CLA) reduce Key–Value (KV) cache size by sharing K/V pairs, strategies to further reduce computational load include reusing computed attention weights across layers, an idea explored previously (e.g., LazyFormer (Ying et al., 2021)). This paper provides an extensive empirical investigation into the phenomenon of attention weight isotropy—high similarity in attention distributions across layers—within diverse modern LLMs (7B-72B scale). We demonstrate how this isotropy develops during pretraining, offering a fundamental insight into LLM attention dynamics. Leveraging these findings, we systematically evaluate and validate a cross-layer weight sharing technique, termed Shared Attention (SA). SA selectively reuses computed attention weights in layer spans identified as isotropic through our analysis. Our experiments across multiple benchmarks show that strategically applied SA maintains comparable performance to baseline models, particularly in later layers where isotropy is pronounced, while significantly reducing computational FLOPs and key cache requirements associated with attention calculation. This work provides principled guidance for optimizing attention mechanisms based on empirically observed layer dynamics in contemporary LLMs. Code and resources are available at <span><span>https://github.com/metacarbon/shareAtt</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":19268,"journal":{"name":"Neurocomputing","volume":"648 ","pages":"Article 130587"},"PeriodicalIF":5.5000,"publicationDate":"2025-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Neurocomputing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0925231225012597","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

The rapid scaling of Large Language Models (LLMs) necessitates advancements in computational and memory efficiency during inference. While methods like Multi-Query Attention (MQA), Grouped-Query Attention (GQA), and Cross-Layer Attention (CLA) reduce Key–Value (KV) cache size by sharing K/V pairs, strategies to further reduce computational load include reusing computed attention weights across layers, an idea explored previously (e.g., LazyFormer (Ying et al., 2021)). This paper provides an extensive empirical investigation into the phenomenon of attention weight isotropy—high similarity in attention distributions across layers—within diverse modern LLMs (7B-72B scale). We demonstrate how this isotropy develops during pretraining, offering a fundamental insight into LLM attention dynamics. Leveraging these findings, we systematically evaluate and validate a cross-layer weight sharing technique, termed Shared Attention (SA). SA selectively reuses computed attention weights in layer spans identified as isotropic through our analysis. Our experiments across multiple benchmarks show that strategically applied SA maintains comparable performance to baseline models, particularly in later layers where isotropy is pronounced, while significantly reducing computational FLOPs and key cache requirements associated with attention calculation. This work provides principled guidance for optimizing attention mechanisms based on empirically observed layer dynamics in contemporary LLMs. Code and resources are available at https://github.com/metacarbon/shareAtt.

查看原文本刊更多论文

超越KV缓存：共享关注高效llm

大型语言模型（llm）的快速扩展要求在推理过程中提高计算和内存效率。虽然像多查询注意（MQA）、分组查询注意（GQA）和跨层注意（CLA）这样的方法通过共享K/V对来减少键值（KV）缓存大小，但进一步减少计算负载的策略包括跨层重用计算的注意权重，这是以前探索过的一个想法（例如，LazyFormer (Ying等人，2021)）。本文对不同现代法学硕士（7B-72B量表）的注意权重各向同性现象（跨层注意分布高度相似）进行了广泛的实证研究。我们展示了这种各向同性是如何在预训练期间发展的，为LLM注意力动力学提供了一个基本的见解。利用这些发现，我们系统地评估和验证了一种称为共享注意力（SA）的跨层权重共享技术。通过我们的分析，SA选择性地重用计算出的各向同性层跨度的注意权重。我们在多个基准测试中的实验表明，策略性地应用SA保持了与基线模型相当的性能，特别是在各向同性明显的后期层中，同时显着降低了与注意力计算相关的计算FLOPs和关键缓存需求。这项工作为优化当代llm中基于经验观察的层动力学的注意机制提供了原则性指导。代码和资源可在https://github.com/metacarbon/shareAtt上获得。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Neurocomputing 工程技术-计算机：人工智能

CiteScore

13.10

自引率

10.00%

发文量

1382

审稿时长

70 days

期刊介绍： Neurocomputing publishes articles describing recent fundamental contributions in the field of neurocomputing. Neurocomputing theory, practice and applications are the essential topics being covered.