{"title":"FlexBCM: Hybrid Block-Circulant Neural Network and Accelerator Co-Search on FPGAs","authors":"Wenqi Lou;Yunji Qin;Xuan Wang;Lei Gong;Chao Wang;Xuehai Zhou","doi":"10.1109/TCAD.2024.3439488","DOIUrl":null,"url":null,"abstract":"Block-circulant matrix (BCM) compression has garnered much attention in the hardware acceleration of convolutional neural networks (CNNs) due to its regularity and efficiency. However, constrained by the difficulty of exploring the compression parameter space, existing BCM-based methods often apply a uniform compression parameter to all CNN models’ layers, losing the compression’s flexibility. Additionally, independently optimizing models or accelerators makes achieving the optimal tradeoff between model accuracy and hardware efficiency challenging. To this end, we propose FlexBCM, a joint exploration framework that efficiently explores both the parameter compression and hardware parameter space to generate customized hybrid BCM-compressed CNN and field-programmable gate array (FPGA) accelerator solutions. On the algorithmic side, leveraging the idea of neural architecture search (NAS), we design an efficient differentiable sampling method to rapidly evaluate the accuracy of candidate subnets. Additionally, we devise a hardware-friendly frequency domain quantization scheme for BCM computation. On the hardware side, we develop the efficient and parameter-configurable convolutional core (ConvPU) alongside the BCM computing core (BCMPU). The BCMPU can flexibly accommodate different compression parameters at runtime, incorporate complex-number DSP packing and conjugate symmetry optimizations. For model-to-hardware evaluation, we construct accurate latency and resource consumption models. Moreover, we design a fast hardware generation algorithm based on the coarse-grained search to provide prompt feedback on the hardware evaluation of the current subnet. Finally, we validate FlexBCM on the Xilinx ZCU102 FPGA and compare its compressed CNN-accelerator solutions with previous state-of-the-art works. Experimental results demonstrate that FlexBCM achieves 1.21–3.02 times higher-computational efficiency for ResNet18 and ResNet34 models while maintaining an acceptable accuracy loss on the ImageNet dataset.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"43 11","pages":"3852-3863"},"PeriodicalIF":2.7000,"publicationDate":"2024-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10745837/","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}
引用次数: 0
Abstract
Block-circulant matrix (BCM) compression has garnered much attention in the hardware acceleration of convolutional neural networks (CNNs) due to its regularity and efficiency. However, constrained by the difficulty of exploring the compression parameter space, existing BCM-based methods often apply a uniform compression parameter to all CNN models’ layers, losing the compression’s flexibility. Additionally, independently optimizing models or accelerators makes achieving the optimal tradeoff between model accuracy and hardware efficiency challenging. To this end, we propose FlexBCM, a joint exploration framework that efficiently explores both the parameter compression and hardware parameter space to generate customized hybrid BCM-compressed CNN and field-programmable gate array (FPGA) accelerator solutions. On the algorithmic side, leveraging the idea of neural architecture search (NAS), we design an efficient differentiable sampling method to rapidly evaluate the accuracy of candidate subnets. Additionally, we devise a hardware-friendly frequency domain quantization scheme for BCM computation. On the hardware side, we develop the efficient and parameter-configurable convolutional core (ConvPU) alongside the BCM computing core (BCMPU). The BCMPU can flexibly accommodate different compression parameters at runtime, incorporate complex-number DSP packing and conjugate symmetry optimizations. For model-to-hardware evaluation, we construct accurate latency and resource consumption models. Moreover, we design a fast hardware generation algorithm based on the coarse-grained search to provide prompt feedback on the hardware evaluation of the current subnet. Finally, we validate FlexBCM on the Xilinx ZCU102 FPGA and compare its compressed CNN-accelerator solutions with previous state-of-the-art works. Experimental results demonstrate that FlexBCM achieves 1.21–3.02 times higher-computational efficiency for ResNet18 and ResNet34 models while maintaining an acceptable accuracy loss on the ImageNet dataset.
期刊介绍:
The purpose of this Transactions is to publish papers of interest to individuals in the area of computer-aided design of integrated circuits and systems composed of analog, digital, mixed-signal, optical, or microwave components. The aids include methods, models, algorithms, and man-machine interfaces for system-level, physical and logical design including: planning, synthesis, partitioning, modeling, simulation, layout, verification, testing, hardware-software co-design and documentation of integrated circuit and system designs of all complexities. Design tools and techniques for evaluating and designing integrated circuits and systems for metrics such as performance, power, reliability, testability, and security are a focus.