Adaptive Two-Range Quantization and Hardware Co-Design for Large Language Model Acceleration

IF 3.8 2区工程技术 Q2 ENGINEERING, ELECTRICAL & ELECTRONIC

IEEE Journal on Emerging and Selected Topics in Circuits and Systems Pub Date : 2025-04-21 DOI:10.1109/JETCAS.2025.3562937

Siqi Cai;Gang Wang;Wenjie Li;Dongxu Lyu;Guanghui He

{"title":"Adaptive Two-Range Quantization and Hardware Co-Design for Large Language Model Acceleration","authors":"Siqi Cai;Gang Wang;Wenjie Li;Dongxu Lyu;Guanghui He","doi":"10.1109/JETCAS.2025.3562937","DOIUrl":null,"url":null,"abstract":"Large language models (LLMs) face high computational and memory demands. While prior studies have leveraged quantization to reduce memory requirements, critical challenges persist: unaligned memory accesses, significant quantization errors when handling outliers that span larger quantization ranges, and the increased hardware overhead associated with processing high-bit-width outliers. To address these issues, we propose a quantization algorithm and hardware architecture co-design for efficient LLM acceleration. Algorithmically, a grouped adaptive two-range quantization (ATRQ) with an in-group embedded identifier is proposed to encode outliers and normal values in distinct ranges, achieving hardware-friendly aligned memory access and reducing quantization errors. From a hardware perspective, we develop a low-overhead ATRQ decoder and an outlier-bit-split processing element (PE) to reduce the hardware overhead associated with high-bit-width outliers, effectively leveraging their inherent sparsity. To support mixed-precision computation and accommodate diverse dataflows during the prefilling and decoding phases, we design a reconfigurable local accumulator that mitigates the overhead associated with additional adders. Experimental results show that the ATRQ-based accelerator outperforms existing solutions, achieving up to <inline-formula> <tex-math>$2.48\\times $ </tex-math></inline-formula> speedup and <inline-formula> <tex-math>$2.01\\times $ </tex-math></inline-formula> energy reduction in LLM prefilling phase, and <inline-formula> <tex-math>$1.87\\times $ </tex-math></inline-formula> speedup and <inline-formula> <tex-math>$2.03\\times $ </tex-math></inline-formula> energy reduction in the decoding phase, with superior model performance under post-training quantization.","PeriodicalId":48827,"journal":{"name":"IEEE Journal on Emerging and Selected Topics in Circuits and Systems","volume":"15 2","pages":"272-284"},"PeriodicalIF":3.8000,"publicationDate":"2025-04-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Journal on Emerging and Selected Topics in Circuits and Systems","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10971983/","RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}

引用次数: 0

Abstract

Large language models (LLMs) face high computational and memory demands. While prior studies have leveraged quantization to reduce memory requirements, critical challenges persist: unaligned memory accesses, significant quantization errors when handling outliers that span larger quantization ranges, and the increased hardware overhead associated with processing high-bit-width outliers. To address these issues, we propose a quantization algorithm and hardware architecture co-design for efficient LLM acceleration. Algorithmically, a grouped adaptive two-range quantization (ATRQ) with an in-group embedded identifier is proposed to encode outliers and normal values in distinct ranges, achieving hardware-friendly aligned memory access and reducing quantization errors. From a hardware perspective, we develop a low-overhead ATRQ decoder and an outlier-bit-split processing element (PE) to reduce the hardware overhead associated with high-bit-width outliers, effectively leveraging their inherent sparsity. To support mixed-precision computation and accommodate diverse dataflows during the prefilling and decoding phases, we design a reconfigurable local accumulator that mitigates the overhead associated with additional adders. Experimental results show that the ATRQ-based accelerator outperforms existing solutions, achieving up to

$2.48\times $

speedup and

$2.01\times $

energy reduction in LLM prefilling phase, and

$1.87\times $

speedup and

$2.03\times $

energy reduction in the decoding phase, with superior model performance under post-training quantization.

查看原文本刊更多论文

大型语言模型加速的自适应双量程量化与硬件协同设计

大型语言模型（llm）面临着高计算和内存需求。虽然以前的研究已经利用量化来减少内存需求，但关键的挑战仍然存在：未对齐的内存访问，处理跨越更大量化范围的异常值时显着的量化错误，以及与处理高位宽异常值相关的硬件开销增加。为了解决这些问题，我们提出了一种量化算法和硬件架构协同设计，以实现高效的LLM加速。在算法上，提出了一种组内嵌入标识符的自适应双量程量化（ATRQ）方法，对不同范围内的离群值和正态值进行编码，实现了硬件友好的对齐存储器访问，减少了量化误差。从硬件的角度来看，我们开发了一个低开销的ATRQ解码器和一个离群值位分割处理元件（PE），以减少与高位宽离群值相关的硬件开销，有效地利用其固有的稀疏性。为了支持混合精度计算，并在预填充和解码阶段适应不同的数据流，我们设计了一个可重构的局部累加器，以减轻与附加加法器相关的开销。实验结果表明，基于atrq的加速方案优于现有方案，在LLM预填充阶段加速高达$2.48\倍，能量降低$2.01\倍，在解码阶段加速高达$1.87\倍，能量降低$2.03\倍，在训练后量化下模型性能优越。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Journal on Emerging and Selected Topics in Circuits and Systems ENGINEERING, ELECTRICAL & ELECTRONIC-

CiteScore

8.50

自引率

2.20%

发文量

期刊介绍： The IEEE Journal on Emerging and Selected Topics in Circuits and Systems is published quarterly and solicits, with particular emphasis on emerging areas, special issues on topics that cover the entire scope of the IEEE Circuits and Systems (CAS) Society, namely the theory, analysis, design, tools, and implementation of circuits and systems, spanning their theoretical foundations, applications, and architectures for signal and information processing.