Beyond KV caching: Shared attention for efficient LLMs

IF 5.5 2区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE
Liao Bingli, Danilo Vasconcellos Vargas
{"title":"Beyond KV caching: Shared attention for efficient LLMs","authors":"Liao Bingli,&nbsp;Danilo Vasconcellos Vargas","doi":"10.1016/j.neucom.2025.130587","DOIUrl":null,"url":null,"abstract":"<div><div>The rapid scaling of Large Language Models (LLMs) necessitates advancements in computational and memory efficiency during inference. While methods like Multi-Query Attention (MQA), Grouped-Query Attention (GQA), and Cross-Layer Attention (CLA) reduce Key–Value (KV) cache size by sharing K/V pairs, strategies to further reduce computational load include reusing computed attention weights across layers, an idea explored previously (e.g., LazyFormer (Ying et al., 2021)). This paper provides an extensive empirical investigation into the phenomenon of attention weight isotropy—high similarity in attention distributions across layers—within diverse modern LLMs (7B-72B scale). We demonstrate how this isotropy develops during pretraining, offering a fundamental insight into LLM attention dynamics. Leveraging these findings, we systematically evaluate and validate a cross-layer weight sharing technique, termed Shared Attention (SA). SA selectively reuses computed attention weights in layer spans identified as isotropic through our analysis. Our experiments across multiple benchmarks show that strategically applied SA maintains comparable performance to baseline models, particularly in later layers where isotropy is pronounced, while significantly reducing computational FLOPs and key cache requirements associated with attention calculation. This work provides principled guidance for optimizing attention mechanisms based on empirically observed layer dynamics in contemporary LLMs. Code and resources are available at <span><span>https://github.com/metacarbon/shareAtt</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":19268,"journal":{"name":"Neurocomputing","volume":"648 ","pages":"Article 130587"},"PeriodicalIF":5.5000,"publicationDate":"2025-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Neurocomputing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0925231225012597","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0

Abstract

The rapid scaling of Large Language Models (LLMs) necessitates advancements in computational and memory efficiency during inference. While methods like Multi-Query Attention (MQA), Grouped-Query Attention (GQA), and Cross-Layer Attention (CLA) reduce Key–Value (KV) cache size by sharing K/V pairs, strategies to further reduce computational load include reusing computed attention weights across layers, an idea explored previously (e.g., LazyFormer (Ying et al., 2021)). This paper provides an extensive empirical investigation into the phenomenon of attention weight isotropy—high similarity in attention distributions across layers—within diverse modern LLMs (7B-72B scale). We demonstrate how this isotropy develops during pretraining, offering a fundamental insight into LLM attention dynamics. Leveraging these findings, we systematically evaluate and validate a cross-layer weight sharing technique, termed Shared Attention (SA). SA selectively reuses computed attention weights in layer spans identified as isotropic through our analysis. Our experiments across multiple benchmarks show that strategically applied SA maintains comparable performance to baseline models, particularly in later layers where isotropy is pronounced, while significantly reducing computational FLOPs and key cache requirements associated with attention calculation. This work provides principled guidance for optimizing attention mechanisms based on empirically observed layer dynamics in contemporary LLMs. Code and resources are available at https://github.com/metacarbon/shareAtt.
超越KV缓存:共享关注高效llm
大型语言模型(llm)的快速扩展要求在推理过程中提高计算和内存效率。虽然像多查询注意(MQA)、分组查询注意(GQA)和跨层注意(CLA)这样的方法通过共享K/V对来减少键值(KV)缓存大小,但进一步减少计算负载的策略包括跨层重用计算的注意权重,这是以前探索过的一个想法(例如,LazyFormer (Ying等人,2021))。本文对不同现代法学硕士(7B-72B量表)的注意权重各向同性现象(跨层注意分布高度相似)进行了广泛的实证研究。我们展示了这种各向同性是如何在预训练期间发展的,为LLM注意力动力学提供了一个基本的见解。利用这些发现,我们系统地评估和验证了一种称为共享注意力(SA)的跨层权重共享技术。通过我们的分析,SA选择性地重用计算出的各向同性层跨度的注意权重。我们在多个基准测试中的实验表明,策略性地应用SA保持了与基线模型相当的性能,特别是在各向同性明显的后期层中,同时显着降低了与注意力计算相关的计算FLOPs和关键缓存需求。这项工作为优化当代llm中基于经验观察的层动力学的注意机制提供了原则性指导。代码和资源可在https://github.com/metacarbon/shareAtt上获得。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Neurocomputing
Neurocomputing 工程技术-计算机:人工智能
CiteScore
13.10
自引率
10.00%
发文量
1382
审稿时长
70 days
期刊介绍: Neurocomputing publishes articles describing recent fundamental contributions in the field of neurocomputing. Neurocomputing theory, practice and applications are the essential topics being covered.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信