Fully Quantized Matrix Arithmetic-Only BERT Model and Its FPGA-Based Accelerator

IF 3.4 3区计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Access Pub Date : 2025-06-20 DOI:10.1109/ACCESS.2025.3581957

Hiroshi Fuketa;Toshihiro Katashita;Yohei Hori;Masakazu Hioki

{"title":"Fully Quantized Matrix Arithmetic-Only BERT Model and Its FPGA-Based Accelerator","authors":"Hiroshi Fuketa;Toshihiro Katashita;Yohei Hori;Masakazu Hioki","doi":"10.1109/ACCESS.2025.3581957","DOIUrl":null,"url":null,"abstract":"In this paper, we propose a fully quantized matrix arithmetic-only BERT (FQ MA-BERT) model to enable efficient natural language processing. Conventionally, the BERT model relies on floating point arithmetic for inference and requires not only linear matrix multiplication but also nonlinear functions, such as softmax and normalization functions, which significantly increase hardware costs. In contrast, the proposed FQ MA-BERT model quantizes all activations and weights to 8-bit integer precision and ternary precision, respectively, i.e., the proposed model is fully quantized. Moreover, all nonlinear layers are replaced with matrix arithmetic operations. This means that the proposed model consists solely of integer-precision linear matrix arithmetic operations, allowing a significant reduction in hardware resources. To validate the hardware-friendliness of the proposed FQ MA-BERT model, we implement an accelerator for the proposed model on AMD Zynq Ultrascale+ FPGA. The proposed accelerator comprises only two types of integer-precision matrix multiplication units without the need for specialized circuits to perform nonlinear functions. The implementation results show that the hardware resource efficiency of the proposed accelerator is 1.8 times higher than that of conventional FPGA-based BERT accelerators.","PeriodicalId":13079,"journal":{"name":"IEEE Access","volume":"13 ","pages":"107165-107174"},"PeriodicalIF":3.4000,"publicationDate":"2025-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11045889","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Access","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/11045889/","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

In this paper, we propose a fully quantized matrix arithmetic-only BERT (FQ MA-BERT) model to enable efficient natural language processing. Conventionally, the BERT model relies on floating point arithmetic for inference and requires not only linear matrix multiplication but also nonlinear functions, such as softmax and normalization functions, which significantly increase hardware costs. In contrast, the proposed FQ MA-BERT model quantizes all activations and weights to 8-bit integer precision and ternary precision, respectively, i.e., the proposed model is fully quantized. Moreover, all nonlinear layers are replaced with matrix arithmetic operations. This means that the proposed model consists solely of integer-precision linear matrix arithmetic operations, allowing a significant reduction in hardware resources. To validate the hardware-friendliness of the proposed FQ MA-BERT model, we implement an accelerator for the proposed model on AMD Zynq Ultrascale+ FPGA. The proposed accelerator comprises only two types of integer-precision matrix multiplication units without the need for specialized circuits to perform nonlinear functions. The implementation results show that the hardware resource efficiency of the proposed accelerator is 1.8 times higher than that of conventional FPGA-based BERT accelerators.

查看原文本刊更多论文

全量化矩阵算法BERT模型及其基于fpga的加速器

在本文中，我们提出了一个全量化矩阵算法BERT （FQ MA-BERT）模型来实现高效的自然语言处理。传统的BERT模型依赖于浮点运算进行推理，不仅需要线性矩阵乘法，还需要非线性函数，如softmax和归一化函数，这大大增加了硬件成本。相比之下，所提出的FQ MA-BERT模型将所有激活和权重分别量化为8位整数精度和三进制精度，即所提出的模型是完全量化的。此外，所有非线性层都用矩阵算术运算代替。这意味着所提出的模型仅由整数精度线性矩阵算术运算组成，从而大大减少了硬件资源。为了验证所提出的FQ MA-BERT模型的硬件友好性，我们在AMD Zynq Ultrascale+ FPGA上实现了所提出模型的加速器。所提出的加速器只包括两种类型的整数精度矩阵乘法单元，而不需要专门的电路来执行非线性函数。实现结果表明，该加速器的硬件资源效率是传统基于fpga的BERT加速器的1.8倍。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Access COMPUTER SCIENCE, INFORMATION SYSTEMSENGIN-ENGINEERING, ELECTRICAL & ELECTRONIC

CiteScore

9.80

自引率

7.70%

发文量

6673

审稿时长

6 weeks

期刊介绍： IEEE Access® is a multidisciplinary, open access (OA), applications-oriented, all-electronic archival journal that continuously presents the results of original research or development across all of IEEE''s fields of interest. IEEE Access will publish articles that are of high interest to readers, original, technically correct, and clearly presented. Supported by author publication charges (APC), its hallmarks are a rapid peer review and publication process with open access to all readers. Unlike IEEE''s traditional Transactions or Journals, reviews are "binary", in that reviewers will either Accept or Reject an article in the form it is submitted in order to achieve rapid turnaround. Especially encouraged are submissions on: Multidisciplinary topics, or applications-oriented articles and negative results that do not fit within the scope of IEEE''s traditional journals. Practical articles discussing new experiments or measurement techniques, interesting solutions to engineering. Development of new or improved fabrication or manufacturing techniques. Reviews or survey articles of new or evolving fields oriented to assist others in understanding the new area.