FPGA Acceleration With Hessian-Based Comprehensive Intra-Layer Mixed-Precision Quantization for Transformer Models

IF 3.4 3区计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Access Pub Date : 2025-04-22 DOI:10.1109/ACCESS.2025.3563196

Woohong Byun;Jongseok Woo;Saibal Mukhopadhyay

{"title":"FPGA Acceleration With Hessian-Based Comprehensive Intra-Layer Mixed-Precision Quantization for Transformer Models","authors":"Woohong Byun;Jongseok Woo;Saibal Mukhopadhyay","doi":"10.1109/ACCESS.2025.3563196","DOIUrl":null,"url":null,"abstract":"Recent advancements in using FPGAs as co-processors for language model acceleration, particularly for energy efficiency and flexibility, face challenges due to limited memory capacity. This limitation hinders the deployment of transformer-based language models. To address this challenge, we propose a novel software-hardware co-optimization framework that integrates Hessian-based intra-layer mixed-precision quantization with a runtime bit-configurable FPGA accelerator. Our proposed Hessian-based row-wise weight quantization addresses hardware inefficiencies in traditional parameter-wise and channel-wise approaches by enabling mixed-precision weight matrices to be split into two uniform-precision matrices, thereby simplifying hardware requirements. Additionally, our Query-Key coupled attention activation quantization optimally aligns precision within each outer product pair in attention calculations, reducing hardware complexity and memory management overhead. Our concurrent quantization method balances the benefits of row-wise weight quantization and Query-Key coupled activation quantization while maximizing energy efficiency through multi-precision optimization. To support this algorithm, we design a multi-precision FPGA accelerator capable of handling both 2n-based and non-2n mixed-precision operations. It is implemented on a single Xilinx ZCU102 FPGA board, operating at 200MHz with a power consumption of 15.08W during inference on the 110-million-parameter BERT-Base and 345-million-parameter GPT-2 Medium transformer models. Coupled with the proposed algorithm and dataflow optimization, it enables on-chip storage of all necessary parameters, minimizing off-chip memory access. Experimental results demonstrate that our FPGA accelerator significantly outperforms existing solutions, achieving energy efficiency improvements ranging from <inline-formula> <tex-math>$2.22\\times $ </tex-math></inline-formula> to <inline-formula> <tex-math>$17.23\\times $ </tex-math></inline-formula> over state-of-the-art FPGA accelerators.","PeriodicalId":13079,"journal":{"name":"IEEE Access","volume":"13 ","pages":"70282-70297"},"PeriodicalIF":3.4000,"publicationDate":"2025-04-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10973048","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Access","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10973048/","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

Recent advancements in using FPGAs as co-processors for language model acceleration, particularly for energy efficiency and flexibility, face challenges due to limited memory capacity. This limitation hinders the deployment of transformer-based language models. To address this challenge, we propose a novel software-hardware co-optimization framework that integrates Hessian-based intra-layer mixed-precision quantization with a runtime bit-configurable FPGA accelerator. Our proposed Hessian-based row-wise weight quantization addresses hardware inefficiencies in traditional parameter-wise and channel-wise approaches by enabling mixed-precision weight matrices to be split into two uniform-precision matrices, thereby simplifying hardware requirements. Additionally, our Query-Key coupled attention activation quantization optimally aligns precision within each outer product pair in attention calculations, reducing hardware complexity and memory management overhead. Our concurrent quantization method balances the benefits of row-wise weight quantization and Query-Key coupled activation quantization while maximizing energy efficiency through multi-precision optimization. To support this algorithm, we design a multi-precision FPGA accelerator capable of handling both 2n-based and non-2n mixed-precision operations. It is implemented on a single Xilinx ZCU102 FPGA board, operating at 200MHz with a power consumption of 15.08W during inference on the 110-million-parameter BERT-Base and 345-million-parameter GPT-2 Medium transformer models. Coupled with the proposed algorithm and dataflow optimization, it enables on-chip storage of all necessary parameters, minimizing off-chip memory access. Experimental results demonstrate that our FPGA accelerator significantly outperforms existing solutions, achieving energy efficiency improvements ranging from

$2.22\times $

$17.23\times $

over state-of-the-art FPGA accelerators.

查看原文本刊更多论文

基于hessian的变压器模型层内混合精度综合量化FPGA加速

最近在使用fpga作为语言模型加速的协处理器方面取得了进展，特别是在能效和灵活性方面，由于内存容量有限，面临着挑战。这个限制阻碍了基于转换器的语言模型的部署。为了解决这一挑战，我们提出了一种新的软硬件协同优化框架，该框架将基于hessian的层内混合精度量化与运行时位可配置的FPGA加速器集成在一起。我们提出的基于hessian的行加权量化解决了传统参数和通道方法中硬件效率低下的问题，通过将混合精度权重矩阵拆分为两个均匀精度矩阵，从而简化了硬件需求。此外，我们的查询键耦合注意力激活量化在注意力计算中最佳地对齐每个外部产品对的精度，降低硬件复杂性和内存管理开销。我们的并行量化方法平衡了行加权量化和查询键耦合激活量化的优点，同时通过多精度优化最大化能源效率。为了支持该算法，我们设计了一个多精度FPGA加速器，能够处理基于2n和非2n的混合精度运算。它在单个Xilinx ZCU102 FPGA板上实现，在1.1亿参数的BERT-Base和3.45亿参数的GPT-2 Medium变压器模型上进行推理时，工作频率为200MHz，功耗为15.08W。结合所提出的算法和数据流优化，它可以在片上存储所有必要的参数，最大限度地减少片外存储器访问。实验结果表明，我们的FPGA加速器显著优于现有的解决方案，与最先进的FPGA加速器相比，实现了2.22美元至17.23美元的能效改进。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Access COMPUTER SCIENCE, INFORMATION SYSTEMSENGIN-ENGINEERING, ELECTRICAL & ELECTRONIC

CiteScore

9.80

自引率

7.70%

发文量

6673

审稿时长

6 weeks

期刊介绍： IEEE Access® is a multidisciplinary, open access (OA), applications-oriented, all-electronic archival journal that continuously presents the results of original research or development across all of IEEE''s fields of interest. IEEE Access will publish articles that are of high interest to readers, original, technically correct, and clearly presented. Supported by author publication charges (APC), its hallmarks are a rapid peer review and publication process with open access to all readers. Unlike IEEE''s traditional Transactions or Journals, reviews are "binary", in that reviewers will either Accept or Reject an article in the form it is submitted in order to achieve rapid turnaround. Especially encouraged are submissions on: Multidisciplinary topics, or applications-oriented articles and negative results that do not fit within the scope of IEEE''s traditional journals. Practical articles discussing new experiments or measurement techniques, interesting solutions to engineering. Development of new or improved fabrication or manufacturing techniques. Reviews or survey articles of new or evolving fields oriented to assist others in understanding the new area.