Huidong Ji;Chen Ding;Boming Huang;Yuxiang Huan;Li-Rong Zheng;Zhuo Zou
{"title":"Communication-Aware and Resource-Efficient NoC-Based Architecture for CNN Acceleration","authors":"Huidong Ji;Chen Ding;Boming Huang;Yuxiang Huan;Li-Rong Zheng;Zhuo Zou","doi":"10.1109/JETCAS.2024.3437408","DOIUrl":null,"url":null,"abstract":"Exploding development of convolutional neural network (CNN) benefits greatly from the hardware-based acceleration to maintain low latency and high utilization of resources. To enhance the processing efficiency of CNN algorithms, Field Programming Gate Array (FPGA)-based accelerators are designed with increased hardware resources to achieve high parallelism and throughput. However, there exist bottlenecks when more processing elements (PEs) in the form of PE clusters are introduced, including 1) the under-utilization of FPGA’s fixed hardware resources, which leads to the effective and peak performance mismatch; and 2) the limited clock frequency caused by the sophisticated routing and complex placement. In this paper, a 2-level hierarchical Network-on-Chip (NoC)-based CNN accelerator is proposed. In the upper level, a mesh-based NoC that interconnects multiple PE clusters is introduced. Such a design not only provides increased flexibility to balance different data communication models for better PE utilization and energy efficiency but also enables globally asynchronous, locally synchronous (GALS) architecture for better timing closure. At the lower level, local PEs are organized into a 3D-tiled PE cluster aiming to maximize the data reuse exploiting inherent dataflow of the convolution networks. Implementation and experiments on Xilinx ZU9EG FPGA for 4 benchmark CNN models: ResNet50, ResNet34, VGG16, and Darknet19 show that our work operates at a frequency of 300 MHz and delivers an effective throughput of 0.998 TOPS, 1.022 TOPS, 1.024 TOPS, and 1.026 TOPS. This result corresponds to 92.85%, 95.1%, 95.25%, and 95.46% PE utilization. Compared with the related FPGA-based designs, our work improves the resource efficiency of DSP by \n<inline-formula> <tex-math>$5.36\\times $ </tex-math></inline-formula>\n, \n<inline-formula> <tex-math>$1.62\\times $ </tex-math></inline-formula>\n, \n<inline-formula> <tex-math>$1.96\\times $ </tex-math></inline-formula>\n, and \n<inline-formula> <tex-math>$5.83\\times $ </tex-math></inline-formula>\n, respectively.","PeriodicalId":48827,"journal":{"name":"IEEE Journal on Emerging and Selected Topics in Circuits and Systems","volume":null,"pages":null},"PeriodicalIF":3.7000,"publicationDate":"2024-08-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Journal on Emerging and Selected Topics in Circuits and Systems","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10621065/","RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
引用次数: 0
Abstract
Exploding development of convolutional neural network (CNN) benefits greatly from the hardware-based acceleration to maintain low latency and high utilization of resources. To enhance the processing efficiency of CNN algorithms, Field Programming Gate Array (FPGA)-based accelerators are designed with increased hardware resources to achieve high parallelism and throughput. However, there exist bottlenecks when more processing elements (PEs) in the form of PE clusters are introduced, including 1) the under-utilization of FPGA’s fixed hardware resources, which leads to the effective and peak performance mismatch; and 2) the limited clock frequency caused by the sophisticated routing and complex placement. In this paper, a 2-level hierarchical Network-on-Chip (NoC)-based CNN accelerator is proposed. In the upper level, a mesh-based NoC that interconnects multiple PE clusters is introduced. Such a design not only provides increased flexibility to balance different data communication models for better PE utilization and energy efficiency but also enables globally asynchronous, locally synchronous (GALS) architecture for better timing closure. At the lower level, local PEs are organized into a 3D-tiled PE cluster aiming to maximize the data reuse exploiting inherent dataflow of the convolution networks. Implementation and experiments on Xilinx ZU9EG FPGA for 4 benchmark CNN models: ResNet50, ResNet34, VGG16, and Darknet19 show that our work operates at a frequency of 300 MHz and delivers an effective throughput of 0.998 TOPS, 1.022 TOPS, 1.024 TOPS, and 1.026 TOPS. This result corresponds to 92.85%, 95.1%, 95.25%, and 95.46% PE utilization. Compared with the related FPGA-based designs, our work improves the resource efficiency of DSP by
$5.36\times $
,
$1.62\times $
,
$1.96\times $
, and
$5.83\times $
, respectively.
期刊介绍:
The IEEE Journal on Emerging and Selected Topics in Circuits and Systems is published quarterly and solicits, with particular emphasis on emerging areas, special issues on topics that cover the entire scope of the IEEE Circuits and Systems (CAS) Society, namely the theory, analysis, design, tools, and implementation of circuits and systems, spanning their theoretical foundations, applications, and architectures for signal and information processing.