End-to-End Acceleration of Generative Models With Runtime Regularized KV Cache Management

IF 3.8 2区工程技术 Q2 ENGINEERING, ELECTRICAL & ELECTRONIC

IEEE Journal on Emerging and Selected Topics in Circuits and Systems Pub Date : 2025-03-09 DOI:10.1109/JETCAS.2025.3568716

Ashkan Moradifirouzabadi;Mingu Kang

{"title":"End-to-End Acceleration of Generative Models With Runtime Regularized KV Cache Management","authors":"Ashkan Moradifirouzabadi;Mingu Kang","doi":"10.1109/JETCAS.2025.3568716","DOIUrl":null,"url":null,"abstract":"Despite their remarkable success in achieving high performance, Transformer-based models impose substantial computational and memory bandwidth requirements, posing significant challenges for hardware deployment. A key contributor to these challenges is the large KV cache, which increases data movement costs in addition to the model parameters. While various token pruning techniques have been proposed to reduce the computational complexity and storage requirements of the attention mechanism by eliminating redundant tokens, these methods often introduce irregularities in the sparsity patterns that complicate hardware implementation. To address these challenges, we propose a hardware and algorithm co-design approach. Our solution features a Runtime Cache Eviction (RCE) algorithm that removes the least relevant tokens and replaces them with newly generated ones, maintaining a constant KV cache size across blocks and inputs. To support this algorithm, we design an accelerator equipped with a KV Memory Management Unit (KV-MMU), which efficiently manages active tokens through eviction and replacement, thereby optimizing DRAM storage and access. Additionally, our design integrates batch processing and an optimized processing pipeline to improve end-to-end throughput, effectively meeting the requirements of both pre-filling and generation stages. The proposed system achieves up to <inline-formula> <tex-math>$8\\times $ </tex-math></inline-formula> KV cache size reduction with minimal accuracy degradation. In a 65 nm process, the proposed accelerator demonstrates <inline-formula> <tex-math>$1.52\\times $ </tex-math></inline-formula> energy savings and <inline-formula> <tex-math>$3.62\\times $ </tex-math></inline-formula> delay reductions when processing a batch size of 16, with only a 1.11% energy overhead attributed to the specialized KV-MMU.","PeriodicalId":48827,"journal":{"name":"IEEE Journal on Emerging and Selected Topics in Circuits and Systems","volume":"15 2","pages":"217-230"},"PeriodicalIF":3.8000,"publicationDate":"2025-03-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Journal on Emerging and Selected Topics in Circuits and Systems","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10994487/","RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}

引用次数: 0

Abstract

Despite their remarkable success in achieving high performance, Transformer-based models impose substantial computational and memory bandwidth requirements, posing significant challenges for hardware deployment. A key contributor to these challenges is the large KV cache, which increases data movement costs in addition to the model parameters. While various token pruning techniques have been proposed to reduce the computational complexity and storage requirements of the attention mechanism by eliminating redundant tokens, these methods often introduce irregularities in the sparsity patterns that complicate hardware implementation. To address these challenges, we propose a hardware and algorithm co-design approach. Our solution features a Runtime Cache Eviction (RCE) algorithm that removes the least relevant tokens and replaces them with newly generated ones, maintaining a constant KV cache size across blocks and inputs. To support this algorithm, we design an accelerator equipped with a KV Memory Management Unit (KV-MMU), which efficiently manages active tokens through eviction and replacement, thereby optimizing DRAM storage and access. Additionally, our design integrates batch processing and an optimized processing pipeline to improve end-to-end throughput, effectively meeting the requirements of both pre-filling and generation stages. The proposed system achieves up to

$8\times $

KV cache size reduction with minimal accuracy degradation. In a 65 nm process, the proposed accelerator demonstrates

$1.52\times $

energy savings and

$3.62\times $

delay reductions when processing a batch size of 16, with only a 1.11% energy overhead attributed to the specialized KV-MMU.

查看原文本刊更多论文

基于运行时正则化KV缓存管理的生成模型端到端加速

尽管它们在实现高性能方面取得了显著的成功，但基于transformer的模型施加了大量的计算和内存带宽要求，给硬件部署带来了重大挑战。造成这些挑战的一个关键因素是大KV缓存，除了模型参数之外，它还增加了数据移动成本。虽然已经提出了各种令牌修剪技术，通过消除冗余令牌来降低注意力机制的计算复杂性和存储需求，但这些方法通常会在稀疏模式中引入不规则性，从而使硬件实现复杂化。为了应对这些挑战，我们提出了一种硬件和算法协同设计方法。我们的解决方案具有运行时缓存清除（RCE）算法，该算法删除最不相关的令牌并用新生成的令牌替换它们，在块和输入之间保持恒定的KV缓存大小。为了支持该算法，我们设计了一个配备KV内存管理单元（KV- mmu）的加速器，该加速器通过移除和替换有效地管理活动令牌，从而优化DRAM存储和访问。此外，我们的设计集成了批量处理和优化的处理管道，以提高端到端吞吐量，有效地满足预填充和生成阶段的要求。所提出的系统在最小精度下降的情况下实现了高达8倍KV的缓存大小减小。在65纳米工艺中，当处理16个批量时，所提出的加速器节省了1.52美元的能源，减少了3.62美元的延迟，而专用的KV-MMU仅减少了1.11%的能源开销。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Journal on Emerging and Selected Topics in Circuits and Systems ENGINEERING, ELECTRICAL & ELECTRONIC-

CiteScore

8.50

自引率

2.20%

发文量

期刊介绍： The IEEE Journal on Emerging and Selected Topics in Circuits and Systems is published quarterly and solicits, with particular emphasis on emerging areas, special issues on topics that cover the entire scope of the IEEE Circuits and Systems (CAS) Society, namely the theory, analysis, design, tools, and implementation of circuits and systems, spanning their theoretical foundations, applications, and architectures for signal and information processing.