{"title":"Beyond KV caching: Shared attention for efficient LLMs","authors":"Liao Bingli, Danilo Vasconcellos Vargas","doi":"10.1016/j.neucom.2025.130587","DOIUrl":null,"url":null,"abstract":"<div><div>The rapid scaling of Large Language Models (LLMs) necessitates advancements in computational and memory efficiency during inference. While methods like Multi-Query Attention (MQA), Grouped-Query Attention (GQA), and Cross-Layer Attention (CLA) reduce Key–Value (KV) cache size by sharing K/V pairs, strategies to further reduce computational load include reusing computed attention weights across layers, an idea explored previously (e.g., LazyFormer (Ying et al., 2021)). This paper provides an extensive empirical investigation into the phenomenon of attention weight isotropy—high similarity in attention distributions across layers—within diverse modern LLMs (7B-72B scale). We demonstrate how this isotropy develops during pretraining, offering a fundamental insight into LLM attention dynamics. Leveraging these findings, we systematically evaluate and validate a cross-layer weight sharing technique, termed Shared Attention (SA). SA selectively reuses computed attention weights in layer spans identified as isotropic through our analysis. Our experiments across multiple benchmarks show that strategically applied SA maintains comparable performance to baseline models, particularly in later layers where isotropy is pronounced, while significantly reducing computational FLOPs and key cache requirements associated with attention calculation. This work provides principled guidance for optimizing attention mechanisms based on empirically observed layer dynamics in contemporary LLMs. Code and resources are available at <span><span>https://github.com/metacarbon/shareAtt</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":19268,"journal":{"name":"Neurocomputing","volume":"648 ","pages":"Article 130587"},"PeriodicalIF":5.5000,"publicationDate":"2025-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Neurocomputing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0925231225012597","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
The rapid scaling of Large Language Models (LLMs) necessitates advancements in computational and memory efficiency during inference. While methods like Multi-Query Attention (MQA), Grouped-Query Attention (GQA), and Cross-Layer Attention (CLA) reduce Key–Value (KV) cache size by sharing K/V pairs, strategies to further reduce computational load include reusing computed attention weights across layers, an idea explored previously (e.g., LazyFormer (Ying et al., 2021)). This paper provides an extensive empirical investigation into the phenomenon of attention weight isotropy—high similarity in attention distributions across layers—within diverse modern LLMs (7B-72B scale). We demonstrate how this isotropy develops during pretraining, offering a fundamental insight into LLM attention dynamics. Leveraging these findings, we systematically evaluate and validate a cross-layer weight sharing technique, termed Shared Attention (SA). SA selectively reuses computed attention weights in layer spans identified as isotropic through our analysis. Our experiments across multiple benchmarks show that strategically applied SA maintains comparable performance to baseline models, particularly in later layers where isotropy is pronounced, while significantly reducing computational FLOPs and key cache requirements associated with attention calculation. This work provides principled guidance for optimizing attention mechanisms based on empirically observed layer dynamics in contemporary LLMs. Code and resources are available at https://github.com/metacarbon/shareAtt.
期刊介绍:
Neurocomputing publishes articles describing recent fundamental contributions in the field of neurocomputing. Neurocomputing theory, practice and applications are the essential topics being covered.