Xianghong Hu , Shansen Fu , Yuanmiao Lin , Xueming Li , Chaoming Yang , Rongfeng Li , Hongmin Huang , Shuting Cai , Xiaoming Xiong
{"title":"An FPGA-based bit-level weight sparsity and mixed-bit accelerator for neural networks","authors":"Xianghong Hu , Shansen Fu , Yuanmiao Lin , Xueming Li , Chaoming Yang , Rongfeng Li , Hongmin Huang , Shuting Cai , Xiaoming Xiong","doi":"10.1016/j.sysarc.2025.103463","DOIUrl":null,"url":null,"abstract":"<div><div>Bit-level weight sparsity and mixed-bit quantization are regarded as effective methods to improve the computing efficiency of convolutional neural network (CNN) accelerators. However, irregular sparse matrices will greatly increase the index overhead and hardware resource consumption. Moreover, bit-serial computing (BSC) is usually adopted to implement bit-level weight sparsity on accelerators, and the traditional BSC leads to uneven utilization of DSP and LUT resources on the FPGA platform, thereby limiting the improvement of the overall performance of the accelerator. Therefore, in this work, we present an accelerator designed for bit-level weight sparsity and mixed-bit quantization. We first introduce a non-linear quantization algorithm named bit-level sparsity learned quantizer (BSLQ), which can maintain high accuracy during mixed quantization and guide the accelerator to complete bit-level weight sparse computations using DSP. Based on this algorithm, we implement the multi-channel bit-level sparsity (MCBS) method to mitigate irregularities and reduce the index count associated with bit-level sparsity. Finally, we propose a sparse weight arbitrary basis scratch pad (SWAB SPad) method that enables retrieval of compressed weights without fetching activations, which can save 30.52% of LUTs and 64.02% of FFs. Experimental results demonstrate that when quantizing ResNet50 and VGG16 using 4/8 bits, our approach achieves accuracy that is comparable to or even better than 32-bit (75.98% and 73.70% for the two models). Compared to the state-of-the-art FPGA-based accelerators, this accelerator achieves up to 5.36 times DSP efficiency improvement and provides 8.87 times energy efficiency improvement.</div></div>","PeriodicalId":50027,"journal":{"name":"Journal of Systems Architecture","volume":"166 ","pages":"Article 103463"},"PeriodicalIF":4.1000,"publicationDate":"2025-05-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Systems Architecture","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1383762125001353","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}
引用次数: 0
Abstract
Bit-level weight sparsity and mixed-bit quantization are regarded as effective methods to improve the computing efficiency of convolutional neural network (CNN) accelerators. However, irregular sparse matrices will greatly increase the index overhead and hardware resource consumption. Moreover, bit-serial computing (BSC) is usually adopted to implement bit-level weight sparsity on accelerators, and the traditional BSC leads to uneven utilization of DSP and LUT resources on the FPGA platform, thereby limiting the improvement of the overall performance of the accelerator. Therefore, in this work, we present an accelerator designed for bit-level weight sparsity and mixed-bit quantization. We first introduce a non-linear quantization algorithm named bit-level sparsity learned quantizer (BSLQ), which can maintain high accuracy during mixed quantization and guide the accelerator to complete bit-level weight sparse computations using DSP. Based on this algorithm, we implement the multi-channel bit-level sparsity (MCBS) method to mitigate irregularities and reduce the index count associated with bit-level sparsity. Finally, we propose a sparse weight arbitrary basis scratch pad (SWAB SPad) method that enables retrieval of compressed weights without fetching activations, which can save 30.52% of LUTs and 64.02% of FFs. Experimental results demonstrate that when quantizing ResNet50 and VGG16 using 4/8 bits, our approach achieves accuracy that is comparable to or even better than 32-bit (75.98% and 73.70% for the two models). Compared to the state-of-the-art FPGA-based accelerators, this accelerator achieves up to 5.36 times DSP efficiency improvement and provides 8.87 times energy efficiency improvement.
期刊介绍:
The Journal of Systems Architecture: Embedded Software Design (JSA) is a journal covering all design and architectural aspects related to embedded systems and software. It ranges from the microarchitecture level via the system software level up to the application-specific architecture level. Aspects such as real-time systems, operating systems, FPGA programming, programming languages, communications (limited to analysis and the software stack), mobile systems, parallel and distributed architectures as well as additional subjects in the computer and system architecture area will fall within the scope of this journal. Technology will not be a main focus, but its use and relevance to particular designs will be. Case studies are welcome but must contribute more than just a design for a particular piece of software.
Design automation of such systems including methodologies, techniques and tools for their design as well as novel designs of software components fall within the scope of this journal. Novel applications that use embedded systems are also central in this journal. While hardware is not a part of this journal hardware/software co-design methods that consider interplay between software and hardware components with and emphasis on software are also relevant here.