基于稀疏设计方案的可重构CNN加速器卷积引擎设计思考

IF 5.2 1区工程技术 Q1 ENGINEERING, ELECTRICAL & ELECTRONIC

IEEE Transactions on Circuits and Systems I: Regular Papers Pub Date : 2025-04-08 DOI:10.1109/TCSI.2025.3554332

Yishuo Meng;Jianfei Wang;Siwei Xiang;Jia Hou;Zhijie Lin;Kuizhi Mei;Chen Yang

{"title":"基于稀疏设计方案的可重构CNN加速器卷积引擎设计思考","authors":"Yishuo Meng;Jianfei Wang;Siwei Xiang;Jia Hou;Zhijie Lin;Kuizhi Mei;Chen Yang","doi":"10.1109/TCSI.2025.3554332","DOIUrl":null,"url":null,"abstract":"Convolutional neural networks (CNNs) are evolving as they are applied to more diverse environments and more difficult challenges. The evolving induces various convolution modes (e.g., <inline-formula> <tex-math>$1\\times 1$ </tex-math></inline-formula> convolution, 2-stride convolution and rectangle convolution) in current CNNs and makes it difficult for the hardware accelerators to efficiently support such various convolution modes. In this paper, it is found that an important difference of these convolution modes is the computation density. Therefore, the above convolution modes are regarded as structured sparse and claims that sparse-based design methodology can be applied for the implementation of the reconfigurable CNN accelerator. Subsequently, two critical architectural parameters, including input tile size and convolution engine (CE) scale, are evaluated based on Standard deviation of calculations (SDC), unsupported convolution mode (UCM) and unsuitable I FM size (UIS), DSP utilization ratio (DUR) as well as hardware resource overhead (HRO), respectively. With the aid of the optimal parameters, a high-parallelism and flexible CE array and a high-performance and reconfigurable CNN architecture are designed. The accelerator was implemented on a Xilinx VC709 FPGA and ran at a clock frequency of 300 MHz, achieving 921.60 to 1382.40 GOPS while supporting various convolution modes. Compared with previous dense-/sparse-based works, the proposed accelerator can realize <inline-formula> <tex-math>$1.35\\times $ </tex-math></inline-formula> to <inline-formula> <tex-math>$10.77\\times $ </tex-math></inline-formula> improvements on performance and <inline-formula> <tex-math>$1.22\\times $ </tex-math></inline-formula> to <inline-formula> <tex-math>$2.84\\times $ </tex-math></inline-formula> improvements on DSP efficiency while deploying VGG16.","PeriodicalId":13039,"journal":{"name":"IEEE Transactions on Circuits and Systems I: Regular Papers","volume":"72 8","pages":"3983-3996"},"PeriodicalIF":5.2000,"publicationDate":"2025-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Rethinking the Designing of Convolution Engine for Reconfigurable CNN Accelerator Using Sparse-Based Design Scheme\",\"authors\":\"Yishuo Meng;Jianfei Wang;Siwei Xiang;Jia Hou;Zhijie Lin;Kuizhi Mei;Chen Yang\",\"doi\":\"10.1109/TCSI.2025.3554332\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Convolutional neural networks (CNNs) are evolving as they are applied to more diverse environments and more difficult challenges. The evolving induces various convolution modes (e.g., <inline-formula> <tex-math>$1\\\\times 1$ </tex-math></inline-formula> convolution, 2-stride convolution and rectangle convolution) in current CNNs and makes it difficult for the hardware accelerators to efficiently support such various convolution modes. In this paper, it is found that an important difference of these convolution modes is the computation density. Therefore, the above convolution modes are regarded as structured sparse and claims that sparse-based design methodology can be applied for the implementation of the reconfigurable CNN accelerator. Subsequently, two critical architectural parameters, including input tile size and convolution engine (CE) scale, are evaluated based on Standard deviation of calculations (SDC), unsupported convolution mode (UCM) and unsuitable I FM size (UIS), DSP utilization ratio (DUR) as well as hardware resource overhead (HRO), respectively. With the aid of the optimal parameters, a high-parallelism and flexible CE array and a high-performance and reconfigurable CNN architecture are designed. The accelerator was implemented on a Xilinx VC709 FPGA and ran at a clock frequency of 300 MHz, achieving 921.60 to 1382.40 GOPS while supporting various convolution modes. Compared with previous dense-/sparse-based works, the proposed accelerator can realize <inline-formula> <tex-math>$1.35\\\\times $ </tex-math></inline-formula> to <inline-formula> <tex-math>$10.77\\\\times $ </tex-math></inline-formula> improvements on performance and <inline-formula> <tex-math>$1.22\\\\times $ </tex-math></inline-formula> to <inline-formula> <tex-math>$2.84\\\\times $ </tex-math></inline-formula> improvements on DSP efficiency while deploying VGG16.\",\"PeriodicalId\":13039,\"journal\":{\"name\":\"IEEE Transactions on Circuits and Systems I: Regular Papers\",\"volume\":\"72 8\",\"pages\":\"3983-3996\"},\"PeriodicalIF\":5.2000,\"publicationDate\":\"2025-04-08\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Circuits and Systems I: Regular Papers\",\"FirstCategoryId\":\"5\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10950430/\",\"RegionNum\":1,\"RegionCategory\":\"工程技术\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"ENGINEERING, ELECTRICAL & ELECTRONIC\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Circuits and Systems I: Regular Papers","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10950430/","RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}

引用次数: 0

摘要

卷积神经网络（cnn）随着应用于更多样化的环境和更困难的挑战而不断发展。这种进化导致了当前cnn的各种卷积模式（例如$1\ × 1$卷积、2步卷积和矩形卷积），使得硬件加速器难以有效地支持这些不同的卷积模式。本文发现这些卷积模式的一个重要区别是计算密度。因此，上述卷积模式被视为结构化稀疏，并声称基于稀疏的设计方法可以应用于可重构CNN加速器的实现。随后，分别基于计算标准差（SDC）、不支持卷积模式（UCM）和不适合I FM大小（UIS）、DSP利用率（DUR）和硬件资源开销（HRO）对输入块大小和卷积引擎（CE）规模这两个关键架构参数进行了评估。在此基础上，设计了高并行柔性CE阵列和高性能可重构CNN结构。该加速器在Xilinx VC709 FPGA上实现，时钟频率为300 MHz，可实现921.60 ~ 1382.40 GOPS，同时支持多种卷积模式。与先前基于密集/稀疏的工作相比，在部署VGG16时，所提出的加速器可以实现1.35倍至10.77倍的性能提升，以及1.22倍至2.84倍的DSP效率提升。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Rethinking the Designing of Convolution Engine for Reconfigurable CNN Accelerator Using Sparse-Based Design Scheme

Convolutional neural networks (CNNs) are evolving as they are applied to more diverse environments and more difficult challenges. The evolving induces various convolution modes (e.g.,

$1\times 1$

convolution, 2-stride convolution and rectangle convolution) in current CNNs and makes it difficult for the hardware accelerators to efficiently support such various convolution modes. In this paper, it is found that an important difference of these convolution modes is the computation density. Therefore, the above convolution modes are regarded as structured sparse and claims that sparse-based design methodology can be applied for the implementation of the reconfigurable CNN accelerator. Subsequently, two critical architectural parameters, including input tile size and convolution engine (CE) scale, are evaluated based on Standard deviation of calculations (SDC), unsupported convolution mode (UCM) and unsuitable I FM size (UIS), DSP utilization ratio (DUR) as well as hardware resource overhead (HRO), respectively. With the aid of the optimal parameters, a high-parallelism and flexible CE array and a high-performance and reconfigurable CNN architecture are designed. The accelerator was implemented on a Xilinx VC709 FPGA and ran at a clock frequency of 300 MHz, achieving 921.60 to 1382.40 GOPS while supporting various convolution modes. Compared with previous dense-/sparse-based works, the proposed accelerator can realize

$1.35\times $

$10.77\times $

improvements on performance and

$1.22\times $

$2.84\times $

improvements on DSP efficiency while deploying VGG16.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE Transactions on Circuits and Systems I: Regular Papers 工程技术-工程：电子与电气

CiteScore

9.80

自引率

11.80%

发文量

441

审稿时长

2 months

期刊介绍： TCAS I publishes regular papers in the field specified by the theory, analysis, design, and practical implementations of circuits, and the application of circuit techniques to systems and to signal processing. Included is the whole spectrum from basic scientific theory to industrial applications. The field of interest covered includes: - Circuits: Analog, Digital and Mixed Signal Circuits and Systems - Nonlinear Circuits and Systems, Integrated Sensors, MEMS and Systems on Chip, Nanoscale Circuits and Systems, Optoelectronic - Circuits and Systems, Power Electronics and Systems - Software for Analog-and-Logic Circuits and Systems - Control aspects of Circuits and Systems.