Adaptive Two-Range Quantization and Hardware Co-Design for Large Language Model Acceleration

IF 3.7 2区 工程技术 Q2 ENGINEERING, ELECTRICAL & ELECTRONIC
Siqi Cai;Gang Wang;Wenjie Li;Dongxu Lyu;Guanghui He
{"title":"Adaptive Two-Range Quantization and Hardware Co-Design for Large Language Model Acceleration","authors":"Siqi Cai;Gang Wang;Wenjie Li;Dongxu Lyu;Guanghui He","doi":"10.1109/JETCAS.2025.3562937","DOIUrl":null,"url":null,"abstract":"Large language models (LLMs) face high computational and memory demands. While prior studies have leveraged quantization to reduce memory requirements, critical challenges persist: unaligned memory accesses, significant quantization errors when handling outliers that span larger quantization ranges, and the increased hardware overhead associated with processing high-bit-width outliers. To address these issues, we propose a quantization algorithm and hardware architecture co-design for efficient LLM acceleration. Algorithmically, a grouped adaptive two-range quantization (ATRQ) with an in-group embedded identifier is proposed to encode outliers and normal values in distinct ranges, achieving hardware-friendly aligned memory access and reducing quantization errors. From a hardware perspective, we develop a low-overhead ATRQ decoder and an outlier-bit-split processing element (PE) to reduce the hardware overhead associated with high-bit-width outliers, effectively leveraging their inherent sparsity. To support mixed-precision computation and accommodate diverse dataflows during the prefilling and decoding phases, we design a reconfigurable local accumulator that mitigates the overhead associated with additional adders. Experimental results show that the ATRQ-based accelerator outperforms existing solutions, achieving up to <inline-formula> <tex-math>$2.48\\times $ </tex-math></inline-formula> speedup and <inline-formula> <tex-math>$2.01\\times $ </tex-math></inline-formula> energy reduction in LLM prefilling phase, and <inline-formula> <tex-math>$1.87\\times $ </tex-math></inline-formula> speedup and <inline-formula> <tex-math>$2.03\\times $ </tex-math></inline-formula> energy reduction in the decoding phase, with superior model performance under post-training quantization.","PeriodicalId":48827,"journal":{"name":"IEEE Journal on Emerging and Selected Topics in Circuits and Systems","volume":"15 2","pages":"272-284"},"PeriodicalIF":3.7000,"publicationDate":"2025-04-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Journal on Emerging and Selected Topics in Circuits and Systems","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10971983/","RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
引用次数: 0

Abstract

Large language models (LLMs) face high computational and memory demands. While prior studies have leveraged quantization to reduce memory requirements, critical challenges persist: unaligned memory accesses, significant quantization errors when handling outliers that span larger quantization ranges, and the increased hardware overhead associated with processing high-bit-width outliers. To address these issues, we propose a quantization algorithm and hardware architecture co-design for efficient LLM acceleration. Algorithmically, a grouped adaptive two-range quantization (ATRQ) with an in-group embedded identifier is proposed to encode outliers and normal values in distinct ranges, achieving hardware-friendly aligned memory access and reducing quantization errors. From a hardware perspective, we develop a low-overhead ATRQ decoder and an outlier-bit-split processing element (PE) to reduce the hardware overhead associated with high-bit-width outliers, effectively leveraging their inherent sparsity. To support mixed-precision computation and accommodate diverse dataflows during the prefilling and decoding phases, we design a reconfigurable local accumulator that mitigates the overhead associated with additional adders. Experimental results show that the ATRQ-based accelerator outperforms existing solutions, achieving up to $2.48\times $ speedup and $2.01\times $ energy reduction in LLM prefilling phase, and $1.87\times $ speedup and $2.03\times $ energy reduction in the decoding phase, with superior model performance under post-training quantization.
大型语言模型加速的自适应双量程量化与硬件协同设计
大型语言模型(llm)面临着高计算和内存需求。虽然以前的研究已经利用量化来减少内存需求,但关键的挑战仍然存在:未对齐的内存访问,处理跨越更大量化范围的异常值时显着的量化错误,以及与处理高位宽异常值相关的硬件开销增加。为了解决这些问题,我们提出了一种量化算法和硬件架构协同设计,以实现高效的LLM加速。在算法上,提出了一种组内嵌入标识符的自适应双量程量化(ATRQ)方法,对不同范围内的离群值和正态值进行编码,实现了硬件友好的对齐存储器访问,减少了量化误差。从硬件的角度来看,我们开发了一个低开销的ATRQ解码器和一个离群值位分割处理元件(PE),以减少与高位宽离群值相关的硬件开销,有效地利用其固有的稀疏性。为了支持混合精度计算,并在预填充和解码阶段适应不同的数据流,我们设计了一个可重构的局部累加器,以减轻与附加加法器相关的开销。实验结果表明,基于atrq的加速方案优于现有方案,在LLM预填充阶段加速高达$2.48\倍,能量降低$2.01\倍,在解码阶段加速高达$1.87\倍,能量降低$2.03\倍,在训练后量化下模型性能优越。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
CiteScore
8.50
自引率
2.20%
发文量
86
期刊介绍: The IEEE Journal on Emerging and Selected Topics in Circuits and Systems is published quarterly and solicits, with particular emphasis on emerging areas, special issues on topics that cover the entire scope of the IEEE Circuits and Systems (CAS) Society, namely the theory, analysis, design, tools, and implementation of circuits and systems, spanning their theoretical foundations, applications, and architectures for signal and information processing.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信