Anchor Attention, Small Cache: Code Generation With Large Language Models

IF 6.5 1区计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING

IEEE Transactions on Software Engineering Pub Date : 2025-03-15 DOI:10.1109/TSE.2025.3570680

Xiangyu Zhang;Yu Zhou;Guang Yang;Harald C. Gall;Taolue Chen

{"title":"Anchor Attention, Small Cache: Code Generation With Large Language Models","authors":"Xiangyu Zhang;Yu Zhou;Guang Yang;Harald C. Gall;Taolue Chen","doi":"10.1109/TSE.2025.3570680","DOIUrl":null,"url":null,"abstract":"The development of large language models (LLMs) has revolutionized automated code generation. However, their high demand of computation resources has hindered a broader deployment and raised environmental concerns. A common strategy for diminishing computational demands is to cache Key-Value (KV) states from the attention mechanism which is adopted predominately by mainstream LLMs. It can mitigate the need of repeated attention computations, but brings significant memory overhead. Current practices in NLP often use sparse attention which may, unfortunately, lead to substantial inaccuracies, or hallucinations, in code generation tasks. In this paper, we analyze the attention weights distribution within code generation models via an empirical study, uncovering a sparsity pattern, i.e., the aggregation of information at specific anchor points. Based on this observation, we propose a novel approach, <monospace>AnchorCoder</monospace>, which features token-wise anchor attention designed to extract and compress the contextual information, and layer-wise anchor attention enabling cross-layer communication to mitigate the issue of excessive superposition caused by the compression. The extensive experiments across multiple benchmark datasets confirm the effectiveness of <monospace>AnchorCoder</monospace>, which can consistently achieve a significant (at least 70%) reduction in KV cache requirements, while preserving the majority of model’s performance.","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":"51 6","pages":"1866-1881"},"PeriodicalIF":6.5000,"publicationDate":"2025-03-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Software Engineering","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/11005718/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}

引用次数: 0

Abstract

The development of large language models (LLMs) has revolutionized automated code generation. However, their high demand of computation resources has hindered a broader deployment and raised environmental concerns. A common strategy for diminishing computational demands is to cache Key-Value (KV) states from the attention mechanism which is adopted predominately by mainstream LLMs. It can mitigate the need of repeated attention computations, but brings significant memory overhead. Current practices in NLP often use sparse attention which may, unfortunately, lead to substantial inaccuracies, or hallucinations, in code generation tasks. In this paper, we analyze the attention weights distribution within code generation models via an empirical study, uncovering a sparsity pattern, i.e., the aggregation of information at specific anchor points. Based on this observation, we propose a novel approach, AnchorCoder, which features token-wise anchor attention designed to extract and compress the contextual information, and layer-wise anchor attention enabling cross-layer communication to mitigate the issue of excessive superposition caused by the compression. The extensive experiments across multiple benchmark datasets confirm the effectiveness of AnchorCoder, which can consistently achieve a significant (at least 70%) reduction in KV cache requirements, while preserving the majority of model’s performance.

查看原文本刊更多论文

锚定注意力，小缓存：使用大型语言模型生成代码

大型语言模型（llm）的开发已经彻底改变了自动代码生成。然而，它们对计算资源的高需求阻碍了更广泛的部署，并引起了环境问题。减少计算需求的一个常见策略是缓存来自注意力机制的键值（KV）状态，这是主流llm主要采用的方法。它可以减轻重复注意力计算的需要，但带来了显著的内存开销。目前NLP的实践经常使用稀疏注意力，不幸的是，这可能导致代码生成任务中的大量不准确或幻觉。本文通过实证研究分析了代码生成模型中的注意力权重分布，揭示了一个稀疏模式，即信息在特定锚点的聚集。基于这一观察，我们提出了一种新颖的方法，AnchorCoder，它具有用于提取和压缩上下文信息的令牌式锚点注意，以及用于跨层通信的分层式锚点注意，以减轻由压缩引起的过度叠加问题。跨多个基准数据集的广泛实验证实了AnchorCoder的有效性，它可以持续地实现KV缓存需求的显着（至少70%）减少，同时保留大部分模型的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Transactions on Software Engineering 工程技术-工程：电子与电气

CiteScore

9.70

自引率

10.80%

发文量

724

审稿时长

6 months

期刊介绍： IEEE Transactions on Software Engineering seeks contributions comprising well-defined theoretical results and empirical studies with potential impacts on software construction, analysis, or management. The scope of this Transactions extends from fundamental mechanisms to the development of principles and their application in specific environments. Specific topic areas include: a) Development and maintenance methods and models: Techniques and principles for specifying, designing, and implementing software systems, encompassing notations and process models. b) Assessment methods: Software tests, validation, reliability models, test and diagnosis procedures, software redundancy, design for error control, and measurements and evaluation of process and product aspects. c) Software project management: Productivity factors, cost models, schedule and organizational issues, and standards. d) Tools and environments: Specific tools, integrated tool environments, associated architectures, databases, and parallel and distributed processing issues. e) System issues: Hardware-software trade-offs. f) State-of-the-art surveys: Syntheses and comprehensive reviews of the historical development within specific areas of interest.