{"title":"Fully Quantized Matrix Arithmetic-Only BERT Model and Its FPGA-Based Accelerator","authors":"Hiroshi Fuketa;Toshihiro Katashita;Yohei Hori;Masakazu Hioki","doi":"10.1109/ACCESS.2025.3581957","DOIUrl":null,"url":null,"abstract":"In this paper, we propose a fully quantized matrix arithmetic-only BERT (FQ MA-BERT) model to enable efficient natural language processing. Conventionally, the BERT model relies on floating point arithmetic for inference and requires not only linear matrix multiplication but also nonlinear functions, such as softmax and normalization functions, which significantly increase hardware costs. In contrast, the proposed FQ MA-BERT model quantizes all activations and weights to 8-bit integer precision and ternary precision, respectively, i.e., the proposed model is fully quantized. Moreover, all nonlinear layers are replaced with matrix arithmetic operations. This means that the proposed model consists solely of integer-precision linear matrix arithmetic operations, allowing a significant reduction in hardware resources. To validate the hardware-friendliness of the proposed FQ MA-BERT model, we implement an accelerator for the proposed model on AMD Zynq Ultrascale+ FPGA. The proposed accelerator comprises only two types of integer-precision matrix multiplication units without the need for specialized circuits to perform nonlinear functions. The implementation results show that the hardware resource efficiency of the proposed accelerator is 1.8 times higher than that of conventional FPGA-based BERT accelerators.","PeriodicalId":13079,"journal":{"name":"IEEE Access","volume":"13 ","pages":"107165-107174"},"PeriodicalIF":3.4000,"publicationDate":"2025-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11045889","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Access","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/11045889/","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0
Abstract
In this paper, we propose a fully quantized matrix arithmetic-only BERT (FQ MA-BERT) model to enable efficient natural language processing. Conventionally, the BERT model relies on floating point arithmetic for inference and requires not only linear matrix multiplication but also nonlinear functions, such as softmax and normalization functions, which significantly increase hardware costs. In contrast, the proposed FQ MA-BERT model quantizes all activations and weights to 8-bit integer precision and ternary precision, respectively, i.e., the proposed model is fully quantized. Moreover, all nonlinear layers are replaced with matrix arithmetic operations. This means that the proposed model consists solely of integer-precision linear matrix arithmetic operations, allowing a significant reduction in hardware resources. To validate the hardware-friendliness of the proposed FQ MA-BERT model, we implement an accelerator for the proposed model on AMD Zynq Ultrascale+ FPGA. The proposed accelerator comprises only two types of integer-precision matrix multiplication units without the need for specialized circuits to perform nonlinear functions. The implementation results show that the hardware resource efficiency of the proposed accelerator is 1.8 times higher than that of conventional FPGA-based BERT accelerators.
IEEE AccessCOMPUTER SCIENCE, INFORMATION SYSTEMSENGIN-ENGINEERING, ELECTRICAL & ELECTRONIC
CiteScore
9.80
自引率
7.70%
发文量
6673
审稿时长
6 weeks
期刊介绍:
IEEE Access® is a multidisciplinary, open access (OA), applications-oriented, all-electronic archival journal that continuously presents the results of original research or development across all of IEEE''s fields of interest.
IEEE Access will publish articles that are of high interest to readers, original, technically correct, and clearly presented. Supported by author publication charges (APC), its hallmarks are a rapid peer review and publication process with open access to all readers. Unlike IEEE''s traditional Transactions or Journals, reviews are "binary", in that reviewers will either Accept or Reject an article in the form it is submitted in order to achieve rapid turnaround. Especially encouraged are submissions on:
Multidisciplinary topics, or applications-oriented articles and negative results that do not fit within the scope of IEEE''s traditional journals.
Practical articles discussing new experiments or measurement techniques, interesting solutions to engineering.
Development of new or improved fabrication or manufacturing techniques.
Reviews or survey articles of new or evolving fields oriented to assist others in understanding the new area.