Low-Bit Mixed-Precision Quantization and Acceleration of CNN for FPGA Deployment

IF 5.3 3区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

IEEE Transactions on Emerging Topics in Computational Intelligence Pub Date : 2024-12-18 DOI:10.1109/TETCI.2024.3510295

JianRong Wang;Zhijun He;Hongbo Zhao;Rongke Liu

{"title":"Low-Bit Mixed-Precision Quantization and Acceleration of CNN for FPGA Deployment","authors":"JianRong Wang;Zhijun He;Hongbo Zhao;Rongke Liu","doi":"10.1109/TETCI.2024.3510295","DOIUrl":null,"url":null,"abstract":"Nowadays, the deployment of intelligent networks on hardware devices for real-time applications is gaining popularity in both academia and industry. However, on-chip resources and power consumption are usually limited, making quantization a crucial step due to its ability to reduce the computational footprint. To this point, mixed-precision bit-width allocation for weights is an effective way to reduce the overall memory footprint while maximizing model accuracy, which can generally be divided into two schemes: per-layer quantization and per-channel quantization. However, the latter has a large searching space, making it hard to obtain optimal solutions, so currently most research focuses on the former scheme. Additionally, there is almost no research targeting the design and optimization of FPGA accelerator structures for per-channel quantization. Motivated by these considerations, this paper first proposes a mixed-precision bit allocation method, called Hierarchical Bit Programming (HBP), which reduces the magnitude of the search space by applying group optimization on channel dimension and consequently reduce the computational complexity of the solving process. Then a loop optimization strategy is presented based on quantization manner, and models are established to evaluate FPGA performance and resource requirement, enabling the evaluation and analysis of accelerator performance bottlenecks and optimization boundaries in the early phase of system design. Based on the optimization results, a hardware accelerator design structure is presented. Several mainstream CNN models are used for evaluation, and on-board tests are conducted on the Zynq MPSoC XCZU15EG FPGA platform. The experiment results show that our HBP method could achieve an improvement of more than 2% on accuracy compared with other related methods. Compared with CPU and GPU, the proposed FPGA accelerator yields speedups of 28.8%, 46.2%, 31.0%, and 35.9% in energy efficiency on VGG-16, ResNet18, ResNet34, and ResNet50, respectively, and the processing latency could be 25% lower than state-of-the-art methods.","PeriodicalId":13135,"journal":{"name":"IEEE Transactions on Emerging Topics in Computational Intelligence","volume":"9 3","pages":"2597-2617"},"PeriodicalIF":5.3000,"publicationDate":"2024-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Emerging Topics in Computational Intelligence","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10806654/","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Nowadays, the deployment of intelligent networks on hardware devices for real-time applications is gaining popularity in both academia and industry. However, on-chip resources and power consumption are usually limited, making quantization a crucial step due to its ability to reduce the computational footprint. To this point, mixed-precision bit-width allocation for weights is an effective way to reduce the overall memory footprint while maximizing model accuracy, which can generally be divided into two schemes: per-layer quantization and per-channel quantization. However, the latter has a large searching space, making it hard to obtain optimal solutions, so currently most research focuses on the former scheme. Additionally, there is almost no research targeting the design and optimization of FPGA accelerator structures for per-channel quantization. Motivated by these considerations, this paper first proposes a mixed-precision bit allocation method, called Hierarchical Bit Programming (HBP), which reduces the magnitude of the search space by applying group optimization on channel dimension and consequently reduce the computational complexity of the solving process. Then a loop optimization strategy is presented based on quantization manner, and models are established to evaluate FPGA performance and resource requirement, enabling the evaluation and analysis of accelerator performance bottlenecks and optimization boundaries in the early phase of system design. Based on the optimization results, a hardware accelerator design structure is presented. Several mainstream CNN models are used for evaluation, and on-board tests are conducted on the Zynq MPSoC XCZU15EG FPGA platform. The experiment results show that our HBP method could achieve an improvement of more than 2% on accuracy compared with other related methods. Compared with CPU and GPU, the proposed FPGA accelerator yields speedups of 28.8%, 46.2%, 31.0%, and 35.9% in energy efficiency on VGG-16, ResNet18, ResNet34, and ResNet50, respectively, and the processing latency could be 25% lower than state-of-the-art methods.

查看原文本刊更多论文

用于FPGA部署的CNN低比特混合精度量化和加速

目前，智能网络在硬件设备上的实时应用越来越受到学术界和工业界的欢迎。然而，芯片上的资源和功耗通常是有限的，这使得量化成为关键的一步，因为它能够减少计算足迹。在这一点上，混合精度的权重位宽分配是在最大化模型精度的同时减少整体内存占用的有效方法，通常分为两种方案：逐层量化和逐通道量化。但后者的搜索空间较大，难以获得最优解，因此目前的研究多集中在前者方案上。此外，针对FPGA单通道量化加速器结构的设计和优化的研究几乎没有。基于这些考虑，本文首先提出了一种混合精度的位分配方法，称为分层位规划（HBP），该方法通过对信道维数进行分组优化来减小搜索空间的大小，从而降低求解过程的计算复杂度。在此基础上，提出了一种基于量化的循环优化策略，并建立了FPGA性能和资源需求评估模型，实现了在系统设计初期对加速器性能瓶颈和优化边界的评估和分析。在优化结果的基础上，提出了硬件加速器的设计结构。采用几种主流的CNN模型进行评估，并在Zynq MPSoC XCZU15EG FPGA平台上进行了板载测试。实验结果表明，与其他相关方法相比，HBP方法的准确率提高了2%以上。与CPU和GPU相比，所提出的FPGA加速器在VGG-16、ResNet18、ResNet34和ResNet50上的能效分别提高了28.8%、46.2%、31.0%和35.9%，处理延迟比目前最先进的方法低25%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Transactions on Emerging Topics in Computational Intelligence Mathematics-Control and Optimization

CiteScore

10.30

自引率

7.50%

发文量

147

期刊介绍： The IEEE Transactions on Emerging Topics in Computational Intelligence (TETCI) publishes original articles on emerging aspects of computational intelligence, including theory, applications, and surveys. TETCI is an electronics only publication. TETCI publishes six issues per year. Authors are encouraged to submit manuscripts in any emerging topic in computational intelligence, especially nature-inspired computing topics not covered by other IEEE Computational Intelligence Society journals. A few such illustrative examples are glial cell networks, computational neuroscience, Brain Computer Interface, ambient intelligence, non-fuzzy computing with words, artificial life, cultural learning, artificial endocrine networks, social reasoning, artificial hormone networks, computational intelligence for the IoT and Smart-X technologies.