Communication-Aware and Resource-Efficient NoC-Based Architecture for CNN Acceleration

IF 3.7 2区工程技术 Q2 ENGINEERING, ELECTRICAL & ELECTRONIC

IEEE Journal on Emerging and Selected Topics in Circuits and Systems Pub Date : 2024-08-02 DOI:10.1109/JETCAS.2024.3437408

Huidong Ji;Chen Ding;Boming Huang;Yuxiang Huan;Li-Rong Zheng;Zhuo Zou

{"title":"Communication-Aware and Resource-Efficient NoC-Based Architecture for CNN Acceleration","authors":"Huidong Ji;Chen Ding;Boming Huang;Yuxiang Huan;Li-Rong Zheng;Zhuo Zou","doi":"10.1109/JETCAS.2024.3437408","DOIUrl":null,"url":null,"abstract":"Exploding development of convolutional neural network (CNN) benefits greatly from the hardware-based acceleration to maintain low latency and high utilization of resources. To enhance the processing efficiency of CNN algorithms, Field Programming Gate Array (FPGA)-based accelerators are designed with increased hardware resources to achieve high parallelism and throughput. However, there exist bottlenecks when more processing elements (PEs) in the form of PE clusters are introduced, including 1) the under-utilization of FPGA’s fixed hardware resources, which leads to the effective and peak performance mismatch; and 2) the limited clock frequency caused by the sophisticated routing and complex placement. In this paper, a 2-level hierarchical Network-on-Chip (NoC)-based CNN accelerator is proposed. In the upper level, a mesh-based NoC that interconnects multiple PE clusters is introduced. Such a design not only provides increased flexibility to balance different data communication models for better PE utilization and energy efficiency but also enables globally asynchronous, locally synchronous (GALS) architecture for better timing closure. At the lower level, local PEs are organized into a 3D-tiled PE cluster aiming to maximize the data reuse exploiting inherent dataflow of the convolution networks. Implementation and experiments on Xilinx ZU9EG FPGA for 4 benchmark CNN models: ResNet50, ResNet34, VGG16, and Darknet19 show that our work operates at a frequency of 300 MHz and delivers an effective throughput of 0.998 TOPS, 1.022 TOPS, 1.024 TOPS, and 1.026 TOPS. This result corresponds to 92.85%, 95.1%, 95.25%, and 95.46% PE utilization. Compared with the related FPGA-based designs, our work improves the resource efficiency of DSP by \n<inline-formula> <tex-math>$5.36\\times $ </tex-math></inline-formula>\n, \n<inline-formula> <tex-math>$1.62\\times $ </tex-math></inline-formula>\n, \n<inline-formula> <tex-math>$1.96\\times $ </tex-math></inline-formula>\n, and \n<inline-formula> <tex-math>$5.83\\times $ </tex-math></inline-formula>\n, respectively.","PeriodicalId":48827,"journal":{"name":"IEEE Journal on Emerging and Selected Topics in Circuits and Systems","volume":"14 3","pages":"440-454"},"PeriodicalIF":3.7000,"publicationDate":"2024-08-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Journal on Emerging and Selected Topics in Circuits and Systems","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10621065/","RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}

引用次数: 0

Abstract

Exploding development of convolutional neural network (CNN) benefits greatly from the hardware-based acceleration to maintain low latency and high utilization of resources. To enhance the processing efficiency of CNN algorithms, Field Programming Gate Array (FPGA)-based accelerators are designed with increased hardware resources to achieve high parallelism and throughput. However, there exist bottlenecks when more processing elements (PEs) in the form of PE clusters are introduced, including 1) the under-utilization of FPGA’s fixed hardware resources, which leads to the effective and peak performance mismatch; and 2) the limited clock frequency caused by the sophisticated routing and complex placement. In this paper, a 2-level hierarchical Network-on-Chip (NoC)-based CNN accelerator is proposed. In the upper level, a mesh-based NoC that interconnects multiple PE clusters is introduced. Such a design not only provides increased flexibility to balance different data communication models for better PE utilization and energy efficiency but also enables globally asynchronous, locally synchronous (GALS) architecture for better timing closure. At the lower level, local PEs are organized into a 3D-tiled PE cluster aiming to maximize the data reuse exploiting inherent dataflow of the convolution networks. Implementation and experiments on Xilinx ZU9EG FPGA for 4 benchmark CNN models: ResNet50, ResNet34, VGG16, and Darknet19 show that our work operates at a frequency of 300 MHz and delivers an effective throughput of 0.998 TOPS, 1.022 TOPS, 1.024 TOPS, and 1.026 TOPS. This result corresponds to 92.85%, 95.1%, 95.25%, and 95.46% PE utilization. Compared with the related FPGA-based designs, our work improves the resource efficiency of DSP by

$5.36\times $

$1.62\times $

$1.96\times $

, and

$5.83\times $

, respectively.

查看原文本刊更多论文

基于 NoC 的通信感知和资源高效的 CNN 加速体系结构

卷积神经网络（CNN）的爆炸式发展极大地受益于基于硬件的加速，以保持低延迟和高资源利用率。为了提高 CNN 算法的处理效率，基于现场编程门阵列（FPGA）的加速器在设计时增加了硬件资源，以实现高并行性和高吞吐量。然而，当引入更多的处理元件（PE）以 PE 簇的形式存在时，就会出现瓶颈，包括：1）FPGA 的固定硬件资源利用率不足，导致有效性能和峰值性能不匹配；2）复杂的布线和复杂的布局导致时钟频率受限。本文提出了一种基于 2 层分级片上网络（NoC）的 CNN 加速器。在上层，引入了一个基于网状的 NoC，将多个 PE 集群互连起来。这种设计不仅提高了平衡不同数据通信模型的灵活性，从而提高了 PE 的利用率和能效，还实现了全局异步、局部同步（GALS）架构，从而提高了时序闭合性。在底层，本地 PE 被组织成三维平铺 PE 集群，目的是利用卷积网络固有的数据流最大限度地提高数据重用率。在 Xilinx ZU9EG FPGA 上对 4 个基准 CNN 模型进行了实现和实验：ResNet50、ResNet34、VGG16 和 Darknet19 表明，我们的工作在 300 MHz 频率下运行，有效吞吐量分别为 0.998 TOPS、1.022 TOPS、1.024 TOPS 和 1.026 TOPS。这一结果相当于 92.85%、95.1%、95.25% 和 95.46% 的 PE 利用率。与基于FPGA的相关设计相比，我们的工作提高了DSP的资源效率，分别为5.36（times）美元、1.62（times）美元、1.96（times）美元和5.83（times）美元。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Journal on Emerging and Selected Topics in Circuits and Systems ENGINEERING, ELECTRICAL & ELECTRONIC-

CiteScore

8.50

自引率

2.20%

发文量

期刊介绍： The IEEE Journal on Emerging and Selected Topics in Circuits and Systems is published quarterly and solicits, with particular emphasis on emerging areas, special issues on topics that cover the entire scope of the IEEE Circuits and Systems (CAS) Society, namely the theory, analysis, design, tools, and implementation of circuits and systems, spanning their theoretical foundations, applications, and architectures for signal and information processing.