Yang Zhang;Xilong Kang;Weixuan Wang;Yizhi Ding;Lizheng Ren;Yiran Zhang;Ruiqi Tan;Zhen Wang;Hao Cai;Bo Liu
{"title":"KV-Cache Oriented Query-Aware Sparse Attention Accelerator With Cross-Stage Precision-Configurable Digital CIM","authors":"Yang Zhang;Xilong Kang;Weixuan Wang;Yizhi Ding;Lizheng Ren;Yiran Zhang;Ruiqi Tan;Zhen Wang;Hao Cai;Bo Liu","doi":"10.1109/TCSII.2025.3580135","DOIUrl":null,"url":null,"abstract":"This brief proposes KV-CIM, a KV-Cache oriented Digital Compute-In-Memory (DCIM) sparse attention accelerator, to address computational and memory bottlenecks in autoregressive inference for large language models. Key innovations include: a) A query-aware pre-compute architecture, which dynamically selects and accesses KV-Cache for critical tokens at the pre-compute stage (Stage1) and deploys KV-Cache segmentally on memory-constrained edge devices while maintaining computational accuracy at the formal computation stage (Stage2); b) A cross-stage DCIM macro featuring precision-configurable adder trees, which works in approximate mode at Stage1 and changes to full precision mode at Stage2; c) A query-stationary dataflow that retains the current query tensors in q-CIM across stages to eliminate data movement. Under 28-nm CMOS technology, the proposed KV-CIM achieves 35.16 TOPS/W and 82% reduction of external memory access with negligible degradation in LLaMA2 expressiveness.","PeriodicalId":13101,"journal":{"name":"IEEE Transactions on Circuits and Systems II: Express Briefs","volume":"72 8","pages":"1073-1077"},"PeriodicalIF":4.9000,"publicationDate":"2025-06-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Circuits and Systems II: Express Briefs","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/11037450/","RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
引用次数: 0
Abstract
This brief proposes KV-CIM, a KV-Cache oriented Digital Compute-In-Memory (DCIM) sparse attention accelerator, to address computational and memory bottlenecks in autoregressive inference for large language models. Key innovations include: a) A query-aware pre-compute architecture, which dynamically selects and accesses KV-Cache for critical tokens at the pre-compute stage (Stage1) and deploys KV-Cache segmentally on memory-constrained edge devices while maintaining computational accuracy at the formal computation stage (Stage2); b) A cross-stage DCIM macro featuring precision-configurable adder trees, which works in approximate mode at Stage1 and changes to full precision mode at Stage2; c) A query-stationary dataflow that retains the current query tensors in q-CIM across stages to eliminate data movement. Under 28-nm CMOS technology, the proposed KV-CIM achieves 35.16 TOPS/W and 82% reduction of external memory access with negligible degradation in LLaMA2 expressiveness.
期刊介绍:
TCAS II publishes brief papers in the field specified by the theory, analysis, design, and practical implementations of circuits, and the application of circuit techniques to systems and to signal processing. Included is the whole spectrum from basic scientific theory to industrial applications. The field of interest covered includes:
Circuits: Analog, Digital and Mixed Signal Circuits and Systems
Nonlinear Circuits and Systems, Integrated Sensors, MEMS and Systems on Chip, Nanoscale Circuits and Systems, Optoelectronic
Circuits and Systems, Power Electronics and Systems
Software for Analog-and-Logic Circuits and Systems
Control aspects of Circuits and Systems.