{"title":"Low-Bit Mixed-Precision Quantization and Acceleration of CNN for FPGA Deployment","authors":"JianRong Wang;Zhijun He;Hongbo Zhao;Rongke Liu","doi":"10.1109/TETCI.2024.3510295","DOIUrl":null,"url":null,"abstract":"Nowadays, the deployment of intelligent networks on hardware devices for real-time applications is gaining popularity in both academia and industry. However, on-chip resources and power consumption are usually limited, making quantization a crucial step due to its ability to reduce the computational footprint. To this point, mixed-precision bit-width allocation for weights is an effective way to reduce the overall memory footprint while maximizing model accuracy, which can generally be divided into two schemes: per-layer quantization and per-channel quantization. However, the latter has a large searching space, making it hard to obtain optimal solutions, so currently most research focuses on the former scheme. Additionally, there is almost no research targeting the design and optimization of FPGA accelerator structures for per-channel quantization. Motivated by these considerations, this paper first proposes a mixed-precision bit allocation method, called Hierarchical Bit Programming (HBP), which reduces the magnitude of the search space by applying group optimization on channel dimension and consequently reduce the computational complexity of the solving process. Then a loop optimization strategy is presented based on quantization manner, and models are established to evaluate FPGA performance and resource requirement, enabling the evaluation and analysis of accelerator performance bottlenecks and optimization boundaries in the early phase of system design. Based on the optimization results, a hardware accelerator design structure is presented. Several mainstream CNN models are used for evaluation, and on-board tests are conducted on the Zynq MPSoC XCZU15EG FPGA platform. The experiment results show that our HBP method could achieve an improvement of more than 2% on accuracy compared with other related methods. Compared with CPU and GPU, the proposed FPGA accelerator yields speedups of 28.8%, 46.2%, 31.0%, and 35.9% in energy efficiency on VGG-16, ResNet18, ResNet34, and ResNet50, respectively, and the processing latency could be 25% lower than state-of-the-art methods.","PeriodicalId":13135,"journal":{"name":"IEEE Transactions on Emerging Topics in Computational Intelligence","volume":"9 3","pages":"2597-2617"},"PeriodicalIF":5.3000,"publicationDate":"2024-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Emerging Topics in Computational Intelligence","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10806654/","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Nowadays, the deployment of intelligent networks on hardware devices for real-time applications is gaining popularity in both academia and industry. However, on-chip resources and power consumption are usually limited, making quantization a crucial step due to its ability to reduce the computational footprint. To this point, mixed-precision bit-width allocation for weights is an effective way to reduce the overall memory footprint while maximizing model accuracy, which can generally be divided into two schemes: per-layer quantization and per-channel quantization. However, the latter has a large searching space, making it hard to obtain optimal solutions, so currently most research focuses on the former scheme. Additionally, there is almost no research targeting the design and optimization of FPGA accelerator structures for per-channel quantization. Motivated by these considerations, this paper first proposes a mixed-precision bit allocation method, called Hierarchical Bit Programming (HBP), which reduces the magnitude of the search space by applying group optimization on channel dimension and consequently reduce the computational complexity of the solving process. Then a loop optimization strategy is presented based on quantization manner, and models are established to evaluate FPGA performance and resource requirement, enabling the evaluation and analysis of accelerator performance bottlenecks and optimization boundaries in the early phase of system design. Based on the optimization results, a hardware accelerator design structure is presented. Several mainstream CNN models are used for evaluation, and on-board tests are conducted on the Zynq MPSoC XCZU15EG FPGA platform. The experiment results show that our HBP method could achieve an improvement of more than 2% on accuracy compared with other related methods. Compared with CPU and GPU, the proposed FPGA accelerator yields speedups of 28.8%, 46.2%, 31.0%, and 35.9% in energy efficiency on VGG-16, ResNet18, ResNet34, and ResNet50, respectively, and the processing latency could be 25% lower than state-of-the-art methods.
期刊介绍:
The IEEE Transactions on Emerging Topics in Computational Intelligence (TETCI) publishes original articles on emerging aspects of computational intelligence, including theory, applications, and surveys.
TETCI is an electronics only publication. TETCI publishes six issues per year.
Authors are encouraged to submit manuscripts in any emerging topic in computational intelligence, especially nature-inspired computing topics not covered by other IEEE Computational Intelligence Society journals. A few such illustrative examples are glial cell networks, computational neuroscience, Brain Computer Interface, ambient intelligence, non-fuzzy computing with words, artificial life, cultural learning, artificial endocrine networks, social reasoning, artificial hormone networks, computational intelligence for the IoT and Smart-X technologies.